Hello Folks, in Spark I have read a file and done some transformation and finally writing to hdfs.
Now I am interested in writing the same dataframe to MapRFS but for this Spark will execute the full DAG again (recompute all the previous steps)(all the read + transformations ). I don't want this recompute again so I decided to cache() the dataframe so that 2nd/nth write won't recompute all the steps . But here is a catch: the cache() takes more time to persist the data in memory. I have a question when the dataframe is in memory then just to save it to another space in memory , why it will take more time (3.2 G data 6 mins) May I know what operations in cache() are taking such a long time ? I would appreciate it if someone would share the information .