[ https://issues.apache.org/jira/browse/SPARK-19046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15791128#comment-15791128 ]
Sean Owen commented on SPARK-19046: ----------------------------------- I don't think that's a bug, because you're storing 100M integers. The fact that it's a range of integers is only available to the in memory representation. Although optimizing things is nice this seems like a toy example. > Dataset checkpoint consumes too much disk space > ----------------------------------------------- > > Key: SPARK-19046 > URL: https://issues.apache.org/jira/browse/SPARK-19046 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Assaf Mendelson > > Consider the following simple example: > val df = spark.range(100000000) > df.cache() > df.count() > df.checkpoint() > df.write.parquet("/test1") > Looking at the storage tab of the UI, the dataframe takes 97.5 MB. > Looking at the checkpoint directory, the checkpoint takes 3.3GB (33 times > larger!) > Looking at the parquet directory, the dataframe takes 386MB > Similar behavior can be seen on less synthetic examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org