I am trying to understand spark execution in case of Dataset.

For RDD i found in Spark Docs below -

Shuffle also generates a large number of intermediate files on disk. As of 
Spark 1.3, these files are preserved until the corresponding RDDs are no longer 
used and are garbage collected. This is done so the shuffle files don't need to 
be re-created if the lineage is re-computed.

I tried runing similar thing with RDD and Dataset, I don't find skipped stages 
in case Dataset execution. Is there any hint i need to do in code for 
preserving shuffle.I mean i want dataset should share shuffle files between 
jobs.
Code sample available here:
https://stackoverflow.com/questions/54848119/dont-find-skipped-stages-in-spark-dataset

Regards,
Dhaval

Reply via email to