You can parallelize on the driver side. The way to do it is almost exactly what you have here, where you're iterating over a local Scala collection of dates and invoking a Spark operation for each. Simply write "dateList.par.map(...)" to make the local map proceed in parallel. It should invoke the Spark jobs simultaneously.
On Fri, Jan 9, 2015 at 10:46 AM, Anders Arpteg <arp...@spotify.com> wrote: > Hey, > > Lets say we have multiple independent jobs that each transform some data and > store in distinct hdfs locations, is there a nice way to run them in > parallel? See the following pseudo code snippet: > > dateList.map(date => > sc.hdfsFile(date).map(transform).saveAsHadoopFile(date)) > > It's unfortunate if they run in sequence, since all the executors are not > used efficiently. What's the best way to parallelize execution of these > jobs? > > Thanks, > Anders --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org