Hi All,

I have a large RDD dataset of around 60-70 GB which I cannot send to driver 
using collect so first writing that to disk using  saveAsTextFile and then this 
data gets saved in the form of multiple part files on each node of the cluster 
and after that driver reads the data from that storage.

I have a question like spark.local.dir is the directory which is used as a 
scratch space where mapoutputs files and RDDs might need to write by spark for 
shuffle operations etc.
And there it is strongly recommended to use local and fast disk to avoid any 
failure or performance impact.

Do we have any such recommendation for storing multiple part files of large 
dataset [ or Big RDD ] in fast disk?
This will help me to configure the write type of disk for resulting part files.

Regards
Ranju

Reply via email to