Re: Queue independent jobs

2015-01-09 Thread Sean Owen
You can parallelize on the driver side. The way to do it is almost
exactly what you have here, where you're iterating over a local Scala
collection of dates and invoking a Spark operation for each. Simply
write dateList.par.map(...) to make the local map proceed in
parallel. It should invoke the Spark jobs simultaneously.

On Fri, Jan 9, 2015 at 10:46 AM, Anders Arpteg arp...@spotify.com wrote:
 Hey,

 Lets say we have multiple independent jobs that each transform some data and
 store in distinct hdfs locations, is there a nice way to run them in
 parallel? See the following pseudo code snippet:

 dateList.map(date =
 sc.hdfsFile(date).map(transform).saveAsHadoopFile(date))

 It's unfortunate if they run in sequence, since all the executors are not
 used efficiently. What's the best way to parallelize execution of these
 jobs?

 Thanks,
 Anders

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Queue independent jobs

2015-01-09 Thread Anders Arpteg
Awesome, it actually seems to work. Amazing how simple it can be
sometimes...

Thanks Sean!

On Fri, Jan 9, 2015 at 12:42 PM, Sean Owen so...@cloudera.com wrote:

 You can parallelize on the driver side. The way to do it is almost
 exactly what you have here, where you're iterating over a local Scala
 collection of dates and invoking a Spark operation for each. Simply
 write dateList.par.map(...) to make the local map proceed in
 parallel. It should invoke the Spark jobs simultaneously.

 On Fri, Jan 9, 2015 at 10:46 AM, Anders Arpteg arp...@spotify.com wrote:
  Hey,
 
  Lets say we have multiple independent jobs that each transform some data
 and
  store in distinct hdfs locations, is there a nice way to run them in
  parallel? See the following pseudo code snippet:
 
  dateList.map(date =
  sc.hdfsFile(date).map(transform).saveAsHadoopFile(date))
 
  It's unfortunate if they run in sequence, since all the executors are not
  used efficiently. What's the best way to parallelize execution of these
  jobs?
 
  Thanks,
  Anders



Queue independent jobs

2015-01-09 Thread Anders Arpteg
Hey,

Lets say we have multiple independent jobs that each transform some data
and store in distinct hdfs locations, is there a nice way to run them in
parallel? See the following pseudo code snippet:

dateList.map(date =
sc.hdfsFile(date).map(transform).saveAsHadoopFile(date))

It's unfortunate if they run in sequence, since all the executors are not
used efficiently. What's the best way to parallelize execution of these
jobs?

Thanks,
Anders