Hi! I would like to reflect only to the first part of your mail:
I have a large RDD dataset of around 60-70 GB which I cannot send to driver > using *collect* so first writing that to disk using *saveAsTextFile* and > then this data gets saved in the form of multiple part files on each node > of the cluster and after that driver reads the data from that storage. What is your use case here? As you mention *collect()* I can assume you have to process the data outside of Spark maybe with a 3rd party tool, isn't it? If you have 60-70 GB of data and you write it to text file then read it back within the same application then you still cannot call *collect()* on it as it is still 60-70GB data, right? On the other hand is your data really just a collection of strings without any repetitions? I ask this because of the fileformat you are using: text file. Even for text file at least you can pass a compression codec as the 2nd argument of *saveAsTextFile()* <https://spark.apache.org/docs/3.1.1/api/scala/org/apache/spark/rdd/RDD.html#saveAsTextFile(path:String,codec:Class[_%3C:org.apache.hadoop.io.compress.CompressionCodec]):Unit> (when you use this link you might need to scroll up a little bit.. at least my chrome displays the the *saveAsTextFile* method without the 2nd arg codec). As IO is slow a compressed data could be read back quicker: as there will be less data in the disk. Check the Snappy <https://en.wikipedia.org/wiki/Snappy_(compression)> codec for example. But if there is a structure of your data and you have plan to process this data further within Spark then please consider something way better: a columnar storage format namely ORC or Parquet. Best Regards, Attila On Sun, Mar 21, 2021 at 3:40 AM Ranju Jain <ranju.j...@ericsson.com.invalid> wrote: > Hi All, > > > > I have a large RDD dataset of around 60-70 GB which I cannot send to > driver using *collect* so first writing that to disk using > *saveAsTextFile* and then this data gets saved in the form of multiple > part files on each node of the cluster and after that driver reads the data > from that storage. > > > > I have a question like *spark.local.dir* is the directory which is used > as a scratch space where mapoutputs files and RDDs might need to write by > spark for shuffle operations etc. > > And there it is strongly recommended to use *local and fast disk *to > avoid any failure or performance impact. > > > > *Do we have any such recommendation for storing multiple part files of > large dataset [ or Big RDD ] in fast disk?* > > This will help me to configure the write type of disk for resulting part > files. > > > > Regards > > Ranju >