[ 
https://issues.apache.org/jira/browse/SPARK-19046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15795004#comment-15795004
 ] 

Assaf Mendelson commented on SPARK-19046:
-----------------------------------------

Currently, behind the scenes we use RDD serialization to disk. I would say we 
should use the dataframe writer to do the serialization instead (and dataframe 
reader for the deserialization).

Even a DataFrameWriter.save followed by DataFrameReader.load would have better 
performance, however, I would imagine that linking the cached version if 
available would provide better performance as it would save the loading time.

The main reason to use checkpoint as opposed to doing save/read manually is 
that spark does a lot of management for us (e.g. it automatically deletes old 
checkpoints when doing streaming).


> Dataset checkpoint consumes too much disk space
> -----------------------------------------------
>
>                 Key: SPARK-19046
>                 URL: https://issues.apache.org/jira/browse/SPARK-19046
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Assaf Mendelson
>
> Consider the following simple example:
> val df = spark.range(100000000)
> df.cache()
> df.count()
> df.checkpoint()
> df.write.parquet("/test1")
> Looking at the storage tab of the UI, the dataframe takes 97.5 MB. 
> Looking at the checkpoint directory, the checkpoint takes 3.3GB (33 times 
> larger!)
> Looking at the parquet directory, the dataframe takes 386MB
> Similar behavior can be seen on less synthetic examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to