Reading the documentation a little more closely, I'm using the wrong
terminology. I'm using stages to refer to what spark is calling a job. I
guess application (more than one spark context) is what I'm asking about
On Dec 5, 2014 5:19 PM, "Corey Nolet" <cjno...@gmail.com> wrote:

> I've read in the documentation that RDDs can be run concurrently when
> submitted in separate threads. I'm curious how the scheduler would handle
> propagating these down to the tasks.
>
> I have 3 RDDs:
> - one RDD which loads some initial data, transforms it and caches it
> - two RDDs which use the cached RDD to provide reports
>
> I'm trying to figure out how the resources will be scheduled to perform
> these stages if I were to concurrently run the two RDDs that depend on the
> first RDD. Would the two RDDs run sequentially? Will they both run @ the
> same time and be smart about how they are caching?
>
> Would this be a time when I'd want to use Tachyon instead and run this as
> 2 separate physical jobs: one to place the shared data in the RAMDISK and
> one to run the two dependent RDDs concurrently? Or would it even be best in
> that case to run 3 completely separate jobs?
>
> We're planning on using YARN so there's 2 levels of scheduling going on.
> We're trying to figure out the best way to utilize the resources so that we
> are fully saturating the system and making sure there's constantly work
> being done rather than anything spinning gears waiting on upstream
> processing to occur (in mapreduce, we'd just submit a ton of jobs and have
> them wait in line).
>

Reply via email to