Hi folks,

We have written a spark job that scans multiple hdfs directories and
perform transformations on them.

For now, this is done with a simple for loop that starts one task at
each iteration. This looks like:

dirs.foreach { case (src,dest) => sc.textFile(src).process.saveAsFile(dest) }


However, each iteration is independent, and we would like to optimize
that by running
them with spark simultaneously (or in a chained fashion), such that we
don't have
idle executors at the end of each iteration (some directories
sometimes only require one partition)


Has anyone already done such a thing? How would you suggest we could do that?

Cheers,

Anselme

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to