Re: Parallelize independent tasks

2014-12-02 Thread Victor Tso-Guillen
dirs.par.foreach { case (src,dest) =>
sc.textFile(src).process.saveAsFile(dest) }

Is that sufficient for you?

On Tuesday, December 2, 2014, Anselme Vignon 
wrote:

> Hi folks,
>
>
> We have written a spark job that scans multiple hdfs directories and
> perform transformations on them.
>
> For now, this is done with a simple for loop that starts one task at
> each iteration. This looks like:
>
> dirs.foreach { case (src,dest) =>
> sc.textFile(src).process.saveAsFile(dest) }
>
>
> However, each iteration is independent, and we would like to optimize
> that by running
> them with spark simultaneously (or in a chained fashion), such that we
> don't have
> idle executors at the end of each iteration (some directories
> sometimes only require one partition)
>
>
> Has anyone already done such a thing? How would you suggest we could do
> that?
>
> Cheers,
>
> Anselme
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>


Parallelize independent tasks

2014-12-02 Thread Anselme Vignon
Hi folks,


We have written a spark job that scans multiple hdfs directories and
perform transformations on them.

For now, this is done with a simple for loop that starts one task at
each iteration. This looks like:

dirs.foreach { case (src,dest) => sc.textFile(src).process.saveAsFile(dest) }


However, each iteration is independent, and we would like to optimize
that by running
them with spark simultaneously (or in a chained fashion), such that we
don't have
idle executors at the end of each iteration (some directories
sometimes only require one partition)


Has anyone already done such a thing? How would you suggest we could do that?

Cheers,

Anselme

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org