Re: Queue independent jobs
You can parallelize on the driver side. The way to do it is almost exactly what you have here, where you're iterating over a local Scala collection of dates and invoking a Spark operation for each. Simply write dateList.par.map(...) to make the local map proceed in parallel. It should invoke the Spark jobs simultaneously. On Fri, Jan 9, 2015 at 10:46 AM, Anders Arpteg arp...@spotify.com wrote: Hey, Lets say we have multiple independent jobs that each transform some data and store in distinct hdfs locations, is there a nice way to run them in parallel? See the following pseudo code snippet: dateList.map(date = sc.hdfsFile(date).map(transform).saveAsHadoopFile(date)) It's unfortunate if they run in sequence, since all the executors are not used efficiently. What's the best way to parallelize execution of these jobs? Thanks, Anders - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Queue independent jobs
Awesome, it actually seems to work. Amazing how simple it can be sometimes... Thanks Sean! On Fri, Jan 9, 2015 at 12:42 PM, Sean Owen so...@cloudera.com wrote: You can parallelize on the driver side. The way to do it is almost exactly what you have here, where you're iterating over a local Scala collection of dates and invoking a Spark operation for each. Simply write dateList.par.map(...) to make the local map proceed in parallel. It should invoke the Spark jobs simultaneously. On Fri, Jan 9, 2015 at 10:46 AM, Anders Arpteg arp...@spotify.com wrote: Hey, Lets say we have multiple independent jobs that each transform some data and store in distinct hdfs locations, is there a nice way to run them in parallel? See the following pseudo code snippet: dateList.map(date = sc.hdfsFile(date).map(transform).saveAsHadoopFile(date)) It's unfortunate if they run in sequence, since all the executors are not used efficiently. What's the best way to parallelize execution of these jobs? Thanks, Anders
Queue independent jobs
Hey, Lets say we have multiple independent jobs that each transform some data and store in distinct hdfs locations, is there a nice way to run them in parallel? See the following pseudo code snippet: dateList.map(date = sc.hdfsFile(date).map(transform).saveAsHadoopFile(date)) It's unfortunate if they run in sequence, since all the executors are not used efficiently. What's the best way to parallelize execution of these jobs? Thanks, Anders