I've read in the documentation that RDDs can be run concurrently when
submitted in separate threads. I'm curious how the scheduler would handle
propagating these down to the tasks.

I have 3 RDDs:
- one RDD which loads some initial data, transforms it and caches it
- two RDDs which use the cached RDD to provide reports

I'm trying to figure out how the resources will be scheduled to perform
these stages if I were to concurrently run the two RDDs that depend on the
first RDD. Would the two RDDs run sequentially? Will they both run @ the
same time and be smart about how they are caching?

Would this be a time when I'd want to use Tachyon instead and run this as 2
separate physical jobs: one to place the shared data in the RAMDISK and one
to run the two dependent RDDs concurrently? Or would it even be best in
that case to run 3 completely separate jobs?

We're planning on using YARN so there's 2 levels of scheduling going on.
We're trying to figure out the best way to utilize the resources so that we
are fully saturating the system and making sure there's constantly work
being done rather than anything spinning gears waiting on upstream
processing to occur (in mapreduce, we'd just submit a ton of jobs and have
them wait in line).

Reply via email to