Assaf Mendelson created SPARK-19046: ---------------------------------------
Summary: Dataset checkpoint consumes too much disk space Key: SPARK-19046 URL: https://issues.apache.org/jira/browse/SPARK-19046 Project: Spark Issue Type: Bug Components: SQL Reporter: Assaf Mendelson Consider the following simple example: val df = spark.range(100000000) df.cache() df.count() df.checkpoint() df.write.parquet("/test1") Looking at the storage tab of the UI, the dataframe takes 97.5 MB. Looking at the checkpoint directory, the checkpoint takes 3.3GB (33 times larger!) Looking at the parquet directory, the dataframe takes 386MB Similar behavior can be seen on less synthetic examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org