[jira] [Commented] (SPARK-19046) Dataset checkpoint consumes too much disk space

2017-01-03 Thread Assaf Mendelson (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795004#comment-15795004 ] Assaf Mendelson commented on SPARK-19046: - Currently, behind the scenes we use RDD serialization

[jira] [Commented] (SPARK-19046) Dataset checkpoint consumes too much disk space

2017-01-03 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15794791#comment-15794791 ] Sean Owen commented on SPARK-19046: --- Yes, Parquet should have some optimizations for serializing cases

[jira] [Commented] (SPARK-19046) Dataset checkpoint consumes too much disk space

2017-01-01 Thread Assaf Mendelson (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15791468#comment-15791468 ] Assaf Mendelson commented on SPARK-19046: - This is an easily created example. I see a factor of

[jira] [Commented] (SPARK-19046) Dataset checkpoint consumes too much disk space

2017-01-01 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15791128#comment-15791128 ] Sean Owen commented on SPARK-19046: --- I don't think that's a bug, because you're storing 100M integers.