Reading the documentation a little more closely, I'm using the wrong terminology. I'm using stages to refer to what spark is calling a job. I guess application (more than one spark context) is what I'm asking about On Dec 5, 2014 5:19 PM, "Corey Nolet" <cjno...@gmail.com> wrote:
> I've read in the documentation that RDDs can be run concurrently when > submitted in separate threads. I'm curious how the scheduler would handle > propagating these down to the tasks. > > I have 3 RDDs: > - one RDD which loads some initial data, transforms it and caches it > - two RDDs which use the cached RDD to provide reports > > I'm trying to figure out how the resources will be scheduled to perform > these stages if I were to concurrently run the two RDDs that depend on the > first RDD. Would the two RDDs run sequentially? Will they both run @ the > same time and be smart about how they are caching? > > Would this be a time when I'd want to use Tachyon instead and run this as > 2 separate physical jobs: one to place the shared data in the RAMDISK and > one to run the two dependent RDDs concurrently? Or would it even be best in > that case to run 3 completely separate jobs? > > We're planning on using YARN so there's 2 levels of scheduling going on. > We're trying to figure out the best way to utilize the resources so that we > are fully saturating the system and making sure there's constantly work > being done rather than anything spinning gears waiting on upstream > processing to occur (in mapreduce, we'd just submit a ton of jobs and have > them wait in line). >