Try passing the shuffle=true parameter to coalesce, then it will do the map in 
parallel but still pass all the data through one reduce node for writing it 
out. That’s probably the fastest it will get. No need to cache if you do that.

Matei

On Mar 21, 2014, at 4:04 PM, Aureliano Buendia <buendia...@gmail.com> wrote:

> Hi,
> 
> Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We 
> found that a partition number of 1000 is a good number to speed the process 
> up. However, it does not make sense to have 1000 pieces of csv files each 
> less than 1 kb.
> 
> We used RDD.coalesce(1) to get only 1 csv file, but it's extremely slow, and 
> we are not properly using our resources this way. So this is very slow:
> 
> rdd.map(...).coalesce(1).saveAsTextFile()
> 
> How is it possible to use coalesce(1) simply for concatenating the 
> materialized output text files? Would something like this make sense?:
> 
> rdd.map(...).coalesce(100).coalesce(1).saveAsTextFile()
> 
> Or, would something like this achieve it?:
> 
> rdd.map(...).cache().coalesce(1).saveAsTextFile()

Reply via email to