caching a dataframe in Spark takes lot of time

Prem Sahoo Tue, 07 May 2024 13:30:58 -0700

Hello Folks,
in Spark I have read a file and done some transformation and finally
writing to hdfs.


Now I am interested in writing the same dataframe to MapRFS but for this
Spark will execute the full DAG again  (recompute all the previous
steps)(all the read + transformations ).

I don't want this recompute again so I decided to cache() the dataframe so
that 2nd/nth write won't recompute all the steps .

But here is a catch: the cache() takes more time to persist the data in
memory.

I have a question when the dataframe is in memory then just to save it to
another space in memory , why it will take more time (3.2 G data 6 mins)

May I know what operations in cache() are taking such a long time ?

I would appreciate it if someone would share the information .

caching a dataframe in Spark takes lot of time

Reply via email to