Cesc  created SPARK-30316:
-----------------------------

             Summary: data size boom after shuffle writing dataframe save as 
parquet
                 Key: SPARK-30316
                 URL: https://issues.apache.org/jira/browse/SPARK-30316
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle, SQL
    Affects Versions: 2.4.4
            Reporter: Cesc 


When I read a same parquet file and then save it in two ways, with shuffle and 
without shuffle, I found the size of output parquet files are quite different. 
For example,  an origin parquet file with 800 MB size, if save without shuffle, 
the size is still 800MB, whereas if I use method repartition and then save it 
as in parquet format, the data size increase to 2.5GB. Row numbers, column 
numbers and content of two output files are all the same.

I wonder:

firstly, why data size will increase after repartition/shuffle?

secondly, if I need shuffle the input dataframe, how to save it as parquet file 
efficiently to avoid data size boom?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to