Awesome, it actually seems to work. Amazing how simple it can be sometimes...
Thanks Sean! On Fri, Jan 9, 2015 at 12:42 PM, Sean Owen <so...@cloudera.com> wrote: > You can parallelize on the driver side. The way to do it is almost > exactly what you have here, where you're iterating over a local Scala > collection of dates and invoking a Spark operation for each. Simply > write "dateList.par.map(...)" to make the local map proceed in > parallel. It should invoke the Spark jobs simultaneously. > > On Fri, Jan 9, 2015 at 10:46 AM, Anders Arpteg <arp...@spotify.com> wrote: > > Hey, > > > > Lets say we have multiple independent jobs that each transform some data > and > > store in distinct hdfs locations, is there a nice way to run them in > > parallel? See the following pseudo code snippet: > > > > dateList.map(date => > > sc.hdfsFile(date).map(transform).saveAsHadoopFile(date)) > > > > It's unfortunate if they run in sequence, since all the executors are not > > used efficiently. What's the best way to parallelize execution of these > > jobs? > > > > Thanks, > > Anders >