[ https://issues.apache.org/jira/browse/SPARK-19046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15791468#comment-15791468 ]
Assaf Mendelson commented on SPARK-19046: ----------------------------------------- This is an easily created example. I see a factor of ~10 in real life data as well compared to parquet file and compared to memory. In fact I got better performance (as well as disk space) by simply saving the dataframe to parquet and reloading. > Dataset checkpoint consumes too much disk space > ----------------------------------------------- > > Key: SPARK-19046 > URL: https://issues.apache.org/jira/browse/SPARK-19046 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Assaf Mendelson > > Consider the following simple example: > val df = spark.range(100000000) > df.cache() > df.count() > df.checkpoint() > df.write.parquet("/test1") > Looking at the storage tab of the UI, the dataframe takes 97.5 MB. > Looking at the checkpoint directory, the checkpoint takes 3.3GB (33 times > larger!) > Looking at the parquet directory, the dataframe takes 386MB > Similar behavior can be seen on less synthetic examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org