Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Aaron Davidson
You can actually submit multiple jobs to a single SparkContext in different
threads. In the case you mentioned with 2 stages having a common parent,
both will wait for the parent stage to complete and then the two will
execute in parallel, sharing the cluster resources.

Solutions that submit multiple applications are also reasonable, but then
you have to manage the job dependencies yourself.

On Sat, Dec 6, 2014 at 8:41 AM, Corey Nolet  wrote:

> Reading the documentation a little more closely, I'm using the wrong
> terminology. I'm using stages to refer to what spark is calling a job. I
> guess application (more than one spark context) is what I'm asking about
> On Dec 5, 2014 5:19 PM, "Corey Nolet"  wrote:
>
>> I've read in the documentation that RDDs can be run concurrently when
>> submitted in separate threads. I'm curious how the scheduler would handle
>> propagating these down to the tasks.
>>
>> I have 3 RDDs:
>> - one RDD which loads some initial data, transforms it and caches it
>> - two RDDs which use the cached RDD to provide reports
>>
>> I'm trying to figure out how the resources will be scheduled to perform
>> these stages if I were to concurrently run the two RDDs that depend on the
>> first RDD. Would the two RDDs run sequentially? Will they both run @ the
>> same time and be smart about how they are caching?
>>
>> Would this be a time when I'd want to use Tachyon instead and run this as
>> 2 separate physical jobs: one to place the shared data in the RAMDISK and
>> one to run the two dependent RDDs concurrently? Or would it even be best in
>> that case to run 3 completely separate jobs?
>>
>> We're planning on using YARN so there's 2 levels of scheduling going on.
>> We're trying to figure out the best way to utilize the resources so that we
>> are fully saturating the system and making sure there's constantly work
>> being done rather than anything spinning gears waiting on upstream
>> processing to occur (in mapreduce, we'd just submit a ton of jobs and have
>> them wait in line).
>>
>


Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Corey Nolet
Reading the documentation a little more closely, I'm using the wrong
terminology. I'm using stages to refer to what spark is calling a job. I
guess application (more than one spark context) is what I'm asking about
On Dec 5, 2014 5:19 PM, "Corey Nolet"  wrote:

> I've read in the documentation that RDDs can be run concurrently when
> submitted in separate threads. I'm curious how the scheduler would handle
> propagating these down to the tasks.
>
> I have 3 RDDs:
> - one RDD which loads some initial data, transforms it and caches it
> - two RDDs which use the cached RDD to provide reports
>
> I'm trying to figure out how the resources will be scheduled to perform
> these stages if I were to concurrently run the two RDDs that depend on the
> first RDD. Would the two RDDs run sequentially? Will they both run @ the
> same time and be smart about how they are caching?
>
> Would this be a time when I'd want to use Tachyon instead and run this as
> 2 separate physical jobs: one to place the shared data in the RAMDISK and
> one to run the two dependent RDDs concurrently? Or would it even be best in
> that case to run 3 completely separate jobs?
>
> We're planning on using YARN so there's 2 levels of scheduling going on.
> We're trying to figure out the best way to utilize the resources so that we
> are fully saturating the system and making sure there's constantly work
> being done rather than anything spinning gears waiting on upstream
> processing to occur (in mapreduce, we'd just submit a ton of jobs and have
> them wait in line).
>


Running two different Spark jobs vs multi-threading RDDs

2014-12-05 Thread Corey Nolet
I've read in the documentation that RDDs can be run concurrently when
submitted in separate threads. I'm curious how the scheduler would handle
propagating these down to the tasks.

I have 3 RDDs:
- one RDD which loads some initial data, transforms it and caches it
- two RDDs which use the cached RDD to provide reports

I'm trying to figure out how the resources will be scheduled to perform
these stages if I were to concurrently run the two RDDs that depend on the
first RDD. Would the two RDDs run sequentially? Will they both run @ the
same time and be smart about how they are caching?

Would this be a time when I'd want to use Tachyon instead and run this as 2
separate physical jobs: one to place the shared data in the RAMDISK and one
to run the two dependent RDDs concurrently? Or would it even be best in
that case to run 3 completely separate jobs?

We're planning on using YARN so there's 2 levels of scheduling going on.
We're trying to figure out the best way to utilize the resources so that we
are fully saturating the system and making sure there's constantly work
being done rather than anything spinning gears waiting on upstream
processing to occur (in mapreduce, we'd just submit a ton of jobs and have
them wait in line).