Re: Running two different Spark jobs vs multi-threading RDDs
You can actually submit multiple jobs to a single SparkContext in different threads. In the case you mentioned with 2 stages having a common parent, both will wait for the parent stage to complete and then the two will execute in parallel, sharing the cluster resources. Solutions that submit multiple applications are also reasonable, but then you have to manage the job dependencies yourself. On Sat, Dec 6, 2014 at 8:41 AM, Corey Nolet wrote: > Reading the documentation a little more closely, I'm using the wrong > terminology. I'm using stages to refer to what spark is calling a job. I > guess application (more than one spark context) is what I'm asking about > On Dec 5, 2014 5:19 PM, "Corey Nolet" wrote: > >> I've read in the documentation that RDDs can be run concurrently when >> submitted in separate threads. I'm curious how the scheduler would handle >> propagating these down to the tasks. >> >> I have 3 RDDs: >> - one RDD which loads some initial data, transforms it and caches it >> - two RDDs which use the cached RDD to provide reports >> >> I'm trying to figure out how the resources will be scheduled to perform >> these stages if I were to concurrently run the two RDDs that depend on the >> first RDD. Would the two RDDs run sequentially? Will they both run @ the >> same time and be smart about how they are caching? >> >> Would this be a time when I'd want to use Tachyon instead and run this as >> 2 separate physical jobs: one to place the shared data in the RAMDISK and >> one to run the two dependent RDDs concurrently? Or would it even be best in >> that case to run 3 completely separate jobs? >> >> We're planning on using YARN so there's 2 levels of scheduling going on. >> We're trying to figure out the best way to utilize the resources so that we >> are fully saturating the system and making sure there's constantly work >> being done rather than anything spinning gears waiting on upstream >> processing to occur (in mapreduce, we'd just submit a ton of jobs and have >> them wait in line). >> >
Re: Running two different Spark jobs vs multi-threading RDDs
Reading the documentation a little more closely, I'm using the wrong terminology. I'm using stages to refer to what spark is calling a job. I guess application (more than one spark context) is what I'm asking about On Dec 5, 2014 5:19 PM, "Corey Nolet" wrote: > I've read in the documentation that RDDs can be run concurrently when > submitted in separate threads. I'm curious how the scheduler would handle > propagating these down to the tasks. > > I have 3 RDDs: > - one RDD which loads some initial data, transforms it and caches it > - two RDDs which use the cached RDD to provide reports > > I'm trying to figure out how the resources will be scheduled to perform > these stages if I were to concurrently run the two RDDs that depend on the > first RDD. Would the two RDDs run sequentially? Will they both run @ the > same time and be smart about how they are caching? > > Would this be a time when I'd want to use Tachyon instead and run this as > 2 separate physical jobs: one to place the shared data in the RAMDISK and > one to run the two dependent RDDs concurrently? Or would it even be best in > that case to run 3 completely separate jobs? > > We're planning on using YARN so there's 2 levels of scheduling going on. > We're trying to figure out the best way to utilize the resources so that we > are fully saturating the system and making sure there's constantly work > being done rather than anything spinning gears waiting on upstream > processing to occur (in mapreduce, we'd just submit a ton of jobs and have > them wait in line). >
Running two different Spark jobs vs multi-threading RDDs
I've read in the documentation that RDDs can be run concurrently when submitted in separate threads. I'm curious how the scheduler would handle propagating these down to the tasks. I have 3 RDDs: - one RDD which loads some initial data, transforms it and caches it - two RDDs which use the cached RDD to provide reports I'm trying to figure out how the resources will be scheduled to perform these stages if I were to concurrently run the two RDDs that depend on the first RDD. Would the two RDDs run sequentially? Will they both run @ the same time and be smart about how they are caching? Would this be a time when I'd want to use Tachyon instead and run this as 2 separate physical jobs: one to place the shared data in the RAMDISK and one to run the two dependent RDDs concurrently? Or would it even be best in that case to run 3 completely separate jobs? We're planning on using YARN so there's 2 levels of scheduling going on. We're trying to figure out the best way to utilize the resources so that we are fully saturating the system and making sure there's constantly work being done rather than anything spinning gears waiting on upstream processing to occur (in mapreduce, we'd just submit a ton of jobs and have them wait in line).