You can parallelize on the driver side. The way to do it is almost
exactly what you have here, where you're iterating over a local Scala
collection of dates and invoking a Spark operation for each. Simply
write "dateList.par.map(...)" to make the local map proceed in
parallel. It should invoke the Spark jobs simultaneously.

On Fri, Jan 9, 2015 at 10:46 AM, Anders Arpteg <arp...@spotify.com> wrote:
> Hey,
>
> Lets say we have multiple independent jobs that each transform some data and
> store in distinct hdfs locations, is there a nice way to run them in
> parallel? See the following pseudo code snippet:
>
> dateList.map(date =>
> sc.hdfsFile(date).map(transform).saveAsHadoopFile(date))
>
> It's unfortunate if they run in sequence, since all the executors are not
> used efficiently. What's the best way to parallelize execution of these
> jobs?
>
> Thanks,
> Anders

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to