[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

2019-12-23 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002529#comment-17002529
 ] 

Xiao Li commented on SPARK-30316:
-

The compression ratio depends on your data layout, instead of number of row. 

> data size boom after shuffle writing dataframe save as parquet
> --
>
> Key: SPARK-30316
> URL: https://issues.apache.org/jira/browse/SPARK-30316
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.4.4
>Reporter: Cesc 
>Priority: Major
>
> When I read a same parquet file and then save it in two ways, with shuffle 
> and without shuffle, I found the size of output parquet files are quite 
> different. For example,  an origin parquet file with 800 MB size, if save 
> without shuffle, the size is still 800MB, whereas if I use method repartition 
> and then save it as in parquet format, the data size increase to 2.5GB. Row 
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet 
> file efficiently to avoid data size boom?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

2019-12-22 Thread Cesc (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002049#comment-17002049
 ] 

Cesc  commented on SPARK-30316:
---

However, the rows of two results are the same.

> data size boom after shuffle writing dataframe save as parquet
> --
>
> Key: SPARK-30316
> URL: https://issues.apache.org/jira/browse/SPARK-30316
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.4.4
>Reporter: Cesc 
>Priority: Blocker
>
> When I read a same parquet file and then save it in two ways, with shuffle 
> and without shuffle, I found the size of output parquet files are quite 
> different. For example,  an origin parquet file with 800 MB size, if save 
> without shuffle, the size is still 800MB, whereas if I use method repartition 
> and then save it as in parquet format, the data size increase to 2.5GB. Row 
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet 
> file efficiently to avoid data size boom?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

2019-12-21 Thread Terry Kim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001833#comment-17001833
 ] 

Terry Kim commented on SPARK-30316:
---

This is a possible scenario because when you repartition/shuffle the data, the 
values you are storing could be reordered such that the compression ratio could 
become worse, for example.  

> data size boom after shuffle writing dataframe save as parquet
> --
>
> Key: SPARK-30316
> URL: https://issues.apache.org/jira/browse/SPARK-30316
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.4.4
>Reporter: Cesc 
>Priority: Blocker
>
> When I read a same parquet file and then save it in two ways, with shuffle 
> and without shuffle, I found the size of output parquet files are quite 
> different. For example,  an origin parquet file with 800 MB size, if save 
> without shuffle, the size is still 800MB, whereas if I use method repartition 
> and then save it as in parquet format, the data size increase to 2.5GB. Row 
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet 
> file efficiently to avoid data size boom?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org