Hi!

I would like to reflect only to the first part of your mail:

I have a large RDD dataset of around 60-70 GB which I cannot send to driver
> using *collect* so first writing that to disk using  *saveAsTextFile* and
> then this data gets saved in the form of multiple part files on each node
> of the cluster and after that driver reads the data from that storage.


What is your use case here?

As you mention *collect()* I can assume you have to process the data
outside of Spark maybe with a 3rd party tool, isn't it?

If you have 60-70 GB of data and you write it to text file then read it
back within the same application then you still cannot call *collect()* on
it as it is still 60-70GB data, right?

On the other hand is your data really just a collection of strings without
any repetitions? I ask this because of the fileformat you are using: text
file. Even for text file at least you can pass a compression codec as the
2nd argument of *saveAsTextFile()*
<https://spark.apache.org/docs/3.1.1/api/scala/org/apache/spark/rdd/RDD.html#saveAsTextFile(path:String,codec:Class[_%3C:org.apache.hadoop.io.compress.CompressionCodec]):Unit>
(when
you use this link you might need to scroll up a little bit.. at least my
chrome displays the the *saveAsTextFile* method without the 2nd arg codec).
As IO is slow a compressed data could be read back quicker: as there will
be less data in the disk. Check the Snappy
<https://en.wikipedia.org/wiki/Snappy_(compression)> codec for example.

But if there is a structure of your data and you have plan to process this
data further within Spark then please consider something way better: a columnar
storage format namely ORC or Parquet.

Best Regards,
Attila


On Sun, Mar 21, 2021 at 3:40 AM Ranju Jain <ranju.j...@ericsson.com.invalid>
wrote:

> Hi All,
>
>
>
> I have a large RDD dataset of around 60-70 GB which I cannot send to
> driver using *collect* so first writing that to disk using
> *saveAsTextFile* and then this data gets saved in the form of multiple
> part files on each node of the cluster and after that driver reads the data
> from that storage.
>
>
>
> I have a question like *spark.local.dir* is the directory which is used
> as a scratch space where mapoutputs files and RDDs might need to write by
> spark for shuffle operations etc.
>
> And there it is strongly recommended to use *local and fast disk *to
> avoid any failure or performance impact.
>
>
>
> *Do we have any such recommendation for storing multiple part files of
> large dataset [ or Big RDD ] in fast disk?*
>
> This will help me to configure the write type of disk for resulting part
> files.
>
>
>
> Regards
>
> Ranju
>

Reply via email to