Hi All, I have a large RDD dataset of around 60-70 GB which I cannot send to driver using collect so first writing that to disk using saveAsTextFile and then this data gets saved in the form of multiple part files on each node of the cluster and after that driver reads the data from that storage.
I have a question like spark.local.dir is the directory which is used as a scratch space where mapoutputs files and RDDs might need to write by spark for shuffle operations etc. And there it is strongly recommended to use local and fast disk to avoid any failure or performance impact. Do we have any such recommendation for storing multiple part files of large dataset [ or Big RDD ] in fast disk? This will help me to configure the write type of disk for resulting part files. Regards Ranju