Hi Andy,

its nice to see that we are not the only ones with the same issues. So
far we have not gone as far as you have. What we have done is that we
cache whatever dataframes/rdds are shared foc computing different
output. This has brought us quite the speedup, but we still see that
saving some large output blocks all other computation even though the
save uses only one executor and rest of the cluster is just waiting.

I was thinking about trying something similar to what you are
describing in 1) but I am sad to see it is riddled with bugs and to me
it seems like going againts spark in a way.

Hope someone can help in resolving this.

Cheers.

Jan
--
Jan Sterba
https://twitter.com/honzasterba | http://flickr.com/honzasterba |
http://500px.com/honzasterba


On Wed, Mar 9, 2016 at 2:31 AM, Andy Sloane <a...@a1k0n.net> wrote:
> We have a somewhat complex pipeline which has multiple output files on HDFS,
> and we'd like the materialization of those outputs to happen concurrently.
>
> Internal to Spark, any "save" call creates a new "job", which runs
> synchronously -- that is, the line of code after your save() executes once
> the job completes, executing the entire dependency DAG to produce it. Same
> with foreach, collect, count, etc.
>
> The files we want to save have overlapping dependencies. For us to create
> multiple outputs concurrently, we have a few options that I can see:
>  - Spawn a thread for each file we want to save, letting Spark run the jobs
> somewhat independently. This has the downside of various concurrency bugs
> (e.g. SPARK-4454, and more recently SPARK-13631) and also causes RDDs up the
> dependency graph to get independently, uselessly recomputed.
>  - Analyze our own dependency graph, materialize (by checkpointing or
> saving) common dependencies, and then executing the two saves in threads.
>  - Implement our own saves to HDFS as side-effects inside mapPartitions
> (which is how save actually works internally anyway, modulo committing logic
> to handle speculative execution), yielding an empty dummy RDD for each thing
> we want to save, and then run foreach or count on the union of all the dummy
> RDDs, which causes Spark to schedule the entire DAG we're interested in.
>
> Currently we are doing a little of #1 and a little of #3, depending on who
> originally wrote the code. #2 is probably closer to what we're supposed to
> be doing, but IMO Spark is already able to produce a good execution plan and
> we shouldn't have to do that.
>
> AFAIK, there's no way to do what I *actually* want in Spark, which is to
> have some control over which saves go into which jobs, and then execute the
> jobs directly. I can envision a new version of the various save functions
> which take an extra job argument, or something, or some way to defer and
> unblock job creation in the spark context.
>
> Ideas?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to