[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet
[ https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002529#comment-17002529 ] Xiao Li commented on SPARK-30316: - The compression ratio depends on your data layout, instead of number of row. > data size boom after shuffle writing dataframe save as parquet > -- > > Key: SPARK-30316 > URL: https://issues.apache.org/jira/browse/SPARK-30316 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.4.4 >Reporter: Cesc >Priority: Major > > When I read a same parquet file and then save it in two ways, with shuffle > and without shuffle, I found the size of output parquet files are quite > different. For example, an origin parquet file with 800 MB size, if save > without shuffle, the size is still 800MB, whereas if I use method repartition > and then save it as in parquet format, the data size increase to 2.5GB. Row > numbers, column numbers and content of two output files are all the same. > I wonder: > firstly, why data size will increase after repartition/shuffle? > secondly, if I need shuffle the input dataframe, how to save it as parquet > file efficiently to avoid data size boom? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet
[ https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002049#comment-17002049 ] Cesc commented on SPARK-30316: --- However, the rows of two results are the same. > data size boom after shuffle writing dataframe save as parquet > -- > > Key: SPARK-30316 > URL: https://issues.apache.org/jira/browse/SPARK-30316 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.4.4 >Reporter: Cesc >Priority: Blocker > > When I read a same parquet file and then save it in two ways, with shuffle > and without shuffle, I found the size of output parquet files are quite > different. For example, an origin parquet file with 800 MB size, if save > without shuffle, the size is still 800MB, whereas if I use method repartition > and then save it as in parquet format, the data size increase to 2.5GB. Row > numbers, column numbers and content of two output files are all the same. > I wonder: > firstly, why data size will increase after repartition/shuffle? > secondly, if I need shuffle the input dataframe, how to save it as parquet > file efficiently to avoid data size boom? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet
[ https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001833#comment-17001833 ] Terry Kim commented on SPARK-30316: --- This is a possible scenario because when you repartition/shuffle the data, the values you are storing could be reordered such that the compression ratio could become worse, for example. > data size boom after shuffle writing dataframe save as parquet > -- > > Key: SPARK-30316 > URL: https://issues.apache.org/jira/browse/SPARK-30316 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.4.4 >Reporter: Cesc >Priority: Blocker > > When I read a same parquet file and then save it in two ways, with shuffle > and without shuffle, I found the size of output parquet files are quite > different. For example, an origin parquet file with 800 MB size, if save > without shuffle, the size is still 800MB, whereas if I use method repartition > and then save it as in parquet format, the data size increase to 2.5GB. Row > numbers, column numbers and content of two output files are all the same. > I wonder: > firstly, why data size will increase after repartition/shuffle? > secondly, if I need shuffle the input dataframe, how to save it as parquet > file efficiently to avoid data size boom? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org