[jira] [Created] (SPARK-19046) Dataset checkpoint consumes too much disk space

Assaf Mendelson (JIRA) Sun, 01 Jan 2017 03:55:33 -0800

Assaf Mendelson created SPARK-19046:
---------------------------------------


             Summary: Dataset checkpoint consumes too much disk space
                 Key: SPARK-19046
                 URL: https://issues.apache.org/jira/browse/SPARK-19046
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Assaf Mendelson


Consider the following simple example:
val df = spark.range(100000000)
df.cache()
df.count()
df.checkpoint()
df.write.parquet("/test1")

Looking at the storage tab of the UI, the dataframe takes 97.5 MB. 
Looking at the checkpoint directory, the checkpoint takes 3.3GB (33 times 
larger!)
Looking at the parquet directory, the dataframe takes 386MB

Similar behavior can be seen on less synthetic examples.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19046) Dataset checkpoint consumes too much disk space

Reply via email to