[ https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002529#comment-17002529 ]
Xiao Li commented on SPARK-30316: --------------------------------- The compression ratio depends on your data layout, instead of number of row. > data size boom after shuffle writing dataframe save as parquet > -------------------------------------------------------------- > > Key: SPARK-30316 > URL: https://issues.apache.org/jira/browse/SPARK-30316 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL > Affects Versions: 2.4.4 > Reporter: Cesc > Priority: Major > > When I read a same parquet file and then save it in two ways, with shuffle > and without shuffle, I found the size of output parquet files are quite > different. For example, an origin parquet file with 800 MB size, if save > without shuffle, the size is still 800MB, whereas if I use method repartition > and then save it as in parquet format, the data size increase to 2.5GB. Row > numbers, column numbers and content of two output files are all the same. > I wonder: > firstly, why data size will increase after repartition/shuffle? > secondly, if I need shuffle the input dataframe, how to save it as parquet > file efficiently to avoid data size boom? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org