Hi,
Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We
found that a partition number of 1000 is a good number to speed the process
up. However, it does not make sense to have 1000 pieces of csv files each
less than 1 kb.
We used RDD.coalesce(1) to get only 1 csv file, but
Try passing the shuffle=true parameter to coalesce, then it will do the map in
parallel but still pass all the data through one reduce node for writing it
out. That’s probably the fastest it will get. No need to cache if you do that.
Matei
On Mar 21, 2014, at 4:04 PM, Aureliano Buendia
Good to know it's as simple as that! I wonder why shuffle=true is not the
default for coalesce().
On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Try passing the shuffle=true parameter to coalesce, then it will do the
map in parallel but still pass all the data
Ah, the reason is because coalesce is often used to deal with lots of small
input files on HDFS. In that case you don’t want to reshuffle them all across
the network, you just want each mapper to directly read multiple files (and you
want fewer than one mapper per file).
Matei
On Mar 21,
.isSuccess).mapValues(value = value
match { case Success(s) = s} )
datatoSave.saveAsObjectFile(Outputall_new/outputall_RMS_ObjFiles)
Deenar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-as-a-single-file-efficiently-tp3014p3021.html
Sent from