
I would like to reflect only to the first part of your mail:

I have a large RDD dataset of around 60-70 GB which I cannot send to driver
> using *collect* so first writing that to disk using  *saveAsTextFile* and
> then this data gets saved in the form of multiple part files on each node
> of the cluster and after that driver reads the data from that storage.

What is your use case here?

As you mention *collect()* I can assume you have to process the data
outside of Spark maybe with a 3rd party tool, isn't it?

If you have 60-70 GB of data and you write it to text file then read it
back within the same application then you still cannot call *collect()* on
it as it is still 60-70GB data, right?

On the other hand is your data really just a collection of strings without
any repetitions? I ask this because of the fileformat you are using: text
file. Even for text file at least you can pass a compression codec as the
2nd argument of *saveAsTextFile()*
you use this link you might need to scroll up a little bit.. at least my
chrome displays the the *saveAsTextFile* method without the 2nd arg codec).
As IO is slow a compressed data could be read back quicker: as there will
be less data in the disk. Check the Snappy
<https://en.wikipedia.org/wiki/Snappy_(compression)> codec for example.

But if there is a structure of your data and you have plan to process this
data further within Spark then please consider something way better: a columnar
storage format namely ORC or Parquet.

On Sun, Mar 21, 2021 at 3:40 AM Ranju Jain

> Hi All,
> I have a question like *spark.local.dir* is the directory which is used
> as a scratch space where mapoutputs files and RDDs might need to write by
> spark for shuffle operations etc.
> And there it is strongly recommended to use *local and fast disk *to
> avoid any failure or performance impact.
> *Do we have any such recommendation for storing multiple part files of
> large dataset [ or Big RDD ] in fast disk?*
> This will help me to configure the write type of disk for resulting part
> files.
