[ https://issues.apache.org/jira/browse/SPARK-19046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15795004#comment-15795004 ]
Assaf Mendelson commented on SPARK-19046: ----------------------------------------- Currently, behind the scenes we use RDD serialization to disk. I would say we should use the dataframe writer to do the serialization instead (and dataframe reader for the deserialization). Even a DataFrameWriter.save followed by DataFrameReader.load would have better performance, however, I would imagine that linking the cached version if available would provide better performance as it would save the loading time. The main reason to use checkpoint as opposed to doing save/read manually is that spark does a lot of management for us (e.g. it automatically deletes old checkpoints when doing streaming). > Dataset checkpoint consumes too much disk space > ----------------------------------------------- > > Key: SPARK-19046 > URL: https://issues.apache.org/jira/browse/SPARK-19046 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Assaf Mendelson > > Consider the following simple example: > val df = spark.range(100000000) > df.cache() > df.count() > df.checkpoint() > df.write.parquet("/test1") > Looking at the storage tab of the UI, the dataframe takes 97.5 MB. > Looking at the checkpoint directory, the checkpoint takes 3.3GB (33 times > larger!) > Looking at the parquet directory, the dataframe takes 386MB > Similar behavior can be seen on less synthetic examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org