Saving multiple outputs in the same job

Andy Sloane Tue, 08 Mar 2016 17:32:30 -0800

We have a somewhat complex pipeline which has multiple output files on
HDFS, and we'd like the materialization of those outputs to happen
concurrently.


Internal to Spark, any "save" call creates a new "job", which runs
synchronously -- that is, the line of code after your save() executes once
the job completes, executing the entire dependency DAG to produce it. Same
with foreach, collect, count, etc.

The files we want to save have overlapping dependencies. For us to create
multiple outputs concurrently, we have a few options that I can see:
 - Spawn a thread for each file we want to save, letting Spark run the jobs
somewhat independently. This has the downside of various concurrency bugs
(e.g. SPARK-4454 <https://issues.apache.org/jira/browse/SPARK-4454>, and
more recently SPARK-13631
<https://issues.apache.org/jira/browse/SPARK-13631>) and also causes RDDs
up the dependency graph to get independently, uselessly recomputed.
 - Analyze our own dependency graph, materialize (by checkpointing or
saving) common dependencies, and then executing the two saves in threads.
 - Implement our own saves to HDFS as side-effects inside mapPartitions
(which is how save actually works internally anyway, modulo committing
logic to handle speculative execution), yielding an empty dummy RDD for
each thing we want to save, and then run foreach or count on the union of
all the dummy RDDs, which causes Spark to schedule the entire DAG we're
interested in.

Currently we are doing a little of #1 and a little of #3, depending on who
originally wrote the code. #2 is probably closer to what we're supposed to
be doing, but IMO Spark is already able to produce a good execution plan
and we shouldn't have to do that.

AFAIK, there's no way to do what I *actually* want in Spark, which is to
have some control over which saves go into which jobs, and then execute the
jobs directly. I can envision a new version of the various save functions
which take an extra job argument, or something, or some way to defer and
unblock job creation in the spark context.

Ideas?

Saving multiple outputs in the same job

Reply via email to