[ 
https://issues.apache.org/jira/browse/SPARK-23771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16474326#comment-16474326
 ] 

Johannes Mayer commented on SPARK-23771:
----------------------------------------

I have tested it in Spark 2.2.0 and the issue still existed

> Uneven Rowgroup size after repartition
> --------------------------------------
>
>                 Key: SPARK-23771
>                 URL: https://issues.apache.org/jira/browse/SPARK-23771
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, Shuffle, SQL
>    Affects Versions: 1.6.0
>         Environment: Cloudera CDH 5.13.1
>            Reporter: Johannes Mayer
>            Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I have a Hive table on AVRO files, that i want to read and store as a 
> partitioned parquet files (one file per partition).
> What i do is:
> {code:java}
> // read the AVRO table and distribute by the partition column
> val data = sql("select * from avro_table distribute by part_col")
>  
> // write data as partitioned parquet files
> data.write.partitionBy(part_col).parquet("output/path/")
> {code}
>  
> I get one file per partition as expected. But often I run into OutOfMemory 
> Errors. Investigating the issue I found out, that some row groups are very 
> big and since all data of a row group is held in memory before it is flushed 
> to disk, i think this causes the OutOfMemory. Other row groups are very 
> small, containing almost no data. See the output from parquet-tools meta:
>  
> {code:java}
> row group 1: RC:5740100 TS:566954562 OFFSET:4 
> row group 2: RC:33769 TS:2904145 OFFSET:117971092 
> row group 3: RC:31822 TS:2772650 OFFSET:118905225 
> row group 4: RC:29854 TS:2704127 OFFSET:119793188 
> row group 5: RC:28050 TS:2356729 OFFSET:120660675 
> row group 6: RC:26507 TS:2111983 OFFSET:121406541 
> row group 7: RC:25143 TS:1967731 OFFSET:122069351 
> row group 8: RC:23876 TS:1991238 OFFSET:122682160 
> row group 9: RC:22584 TS:2069463 OFFSET:123303246 
> row group 10: RC:21225 TS:1955748 OFFSET:123960700 
> row group 11: RC:19960 TS:1931889 OFFSET:124575333 
> row group 12: RC:18806 TS:1725871 OFFSET:125132862 
> row group 13: RC:17719 TS:1653309 OFFSET:125668057 
> row group 14: RC:1617743 TS:157973949 OFFSET:134217728{code}
>  
> One thing to notice is, that this file was written in a Spark application 
> running on 13 executors. Is it possible, that local data is in the big row 
> group and the remote reads go into seperate (small) row groups? The shuffle 
> is involved, because data is read with distribute by clause.
>  
> Is this a known bug? Is there a workaround to get even row group sizes? I 
> want to decrease the row group size using 
> sc.hadoopConfiguration.setInt("parquet.block.size", 64 * 1024 * 1024)
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to