Re: Saving multiple outputs in the same job

2016-03-09 Thread Jeff Zhang
Spark will skip the stage if it is computed by other jobs. That means the common parent RDD of each job only needs to be computed once. But it is still multiple sequential jobs, not concurrent jobs. On Wed, Mar 9, 2016 at 3:29 PM, Jan Štěrba wrote: > Hi Andy, > > its nice to

Re: Saving multiple outputs in the same job

2016-03-08 Thread Jan Štěrba
Hi Andy, its nice to see that we are not the only ones with the same issues. So far we have not gone as far as you have. What we have done is that we cache whatever dataframes/rdds are shared foc computing different output. This has brought us quite the speedup, but we still see that saving some

Saving multiple outputs in the same job

2016-03-08 Thread Andy Sloane
We have a somewhat complex pipeline which has multiple output files on HDFS, and we'd like the materialization of those outputs to happen concurrently. Internal to Spark, any "save" call creates a new "job", which runs synchronously -- that is, the line of code after your save() executes once the