Re: Saving multiple outputs in the same job

Jeff Zhang Wed, 09 Mar 2016 00:08:19 -0800

Spark will skip the stage if it is computed by other jobs. That means the
common parent RDD of each job only needs to be computed once. But it is
still multiple sequential jobs, not concurrent jobs.


On Wed, Mar 9, 2016 at 3:29 PM, Jan Štěrba <i...@jansterba.com> wrote:

> Hi Andy,
>
> its nice to see that we are not the only ones with the same issues. So
> far we have not gone as far as you have. What we have done is that we
> cache whatever dataframes/rdds are shared foc computing different
> output. This has brought us quite the speedup, but we still see that
> saving some large output blocks all other computation even though the
> save uses only one executor and rest of the cluster is just waiting.
>
> I was thinking about trying something similar to what you are
> describing in 1) but I am sad to see it is riddled with bugs and to me
> it seems like going againts spark in a way.
>
> Hope someone can help in resolving this.
>
> Cheers.
>
> Jan
> --
> Jan Sterba
> https://twitter.com/honzasterba | http://flickr.com/honzasterba |
> http://500px.com/honzasterba
>
>
> On Wed, Mar 9, 2016 at 2:31 AM, Andy Sloane <a...@a1k0n.net> wrote:
> > We have a somewhat complex pipeline which has multiple output files on
> HDFS,
> > and we'd like the materialization of those outputs to happen
> concurrently.
> >
> > Internal to Spark, any "save" call creates a new "job", which runs
> > synchronously -- that is, the line of code after your save() executes
> once
> > the job completes, executing the entire dependency DAG to produce it.
> Same
> > with foreach, collect, count, etc.
> >
> > The files we want to save have overlapping dependencies. For us to create
> > multiple outputs concurrently, we have a few options that I can see:
> >  - Spawn a thread for each file we want to save, letting Spark run the
> jobs
> > somewhat independently. This has the downside of various concurrency bugs
> > (e.g. SPARK-4454, and more recently SPARK-13631) and also causes RDDs up
> the
> > dependency graph to get independently, uselessly recomputed.
> >  - Analyze our own dependency graph, materialize (by checkpointing or
> > saving) common dependencies, and then executing the two saves in threads.
> >  - Implement our own saves to HDFS as side-effects inside mapPartitions
> > (which is how save actually works internally anyway, modulo committing
> logic
> > to handle speculative execution), yielding an empty dummy RDD for each
> thing
> > we want to save, and then run foreach or count on the union of all the
> dummy
> > RDDs, which causes Spark to schedule the entire DAG we're interested in.
> >
> > Currently we are doing a little of #1 and a little of #3, depending on
> who
> > originally wrote the code. #2 is probably closer to what we're supposed
> to
> > be doing, but IMO Spark is already able to produce a good execution plan
> and
> > we shouldn't have to do that.
> >
> > AFAIK, there's no way to do what I *actually* want in Spark, which is to
> > have some control over which saves go into which jobs, and then execute
> the
> > jobs directly. I can envision a new version of the various save functions
> > which take an extra job argument, or something, or some way to defer and
> > unblock job creation in the spark context.
> >
> > Ideas?
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: Saving multiple outputs in the same job

Reply via email to