[ https://issues.apache.org/jira/browse/SPARK-23771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcelo Vanzin updated SPARK-23771: ----------------------------------- Priority: Major (was: Critical) > Uneven Rowgroup size after repartition > -------------------------------------- > > Key: SPARK-23771 > URL: https://issues.apache.org/jira/browse/SPARK-23771 > Project: Spark > Issue Type: Bug > Components: Input/Output, Shuffle, SQL > Affects Versions: 1.6.0 > Environment: Cloudera CDH 5.13.1 > Reporter: Johannes Mayer > Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > I have a Hive table on AVRO files, that i want to read and store as a > partitioned parquet files (one file per partition). > What i do is: > {code:java} > // read the AVRO table and distribute by the partition column > val data = sql("select * from avro_table distribute by part_col") > > // write data as partitioned parquet files > data.write.partitionBy(part_col).parquet("output/path/") > {code} > > I get one file per partition as expected. But often I run into OutOfMemory > Errors. Investigating the issue I found out, that some row groups are very > big and since all data of a row group is held in memory before it is flushed > to disk, i think this causes the OutOfMemory. Other row groups are very > small, containing almost no data. See the output from parquet-tools meta: > > {code:java} > row group 1: RC:5740100 TS:566954562 OFFSET:4 > row group 2: RC:33769 TS:2904145 OFFSET:117971092 > row group 3: RC:31822 TS:2772650 OFFSET:118905225 > row group 4: RC:29854 TS:2704127 OFFSET:119793188 > row group 5: RC:28050 TS:2356729 OFFSET:120660675 > row group 6: RC:26507 TS:2111983 OFFSET:121406541 > row group 7: RC:25143 TS:1967731 OFFSET:122069351 > row group 8: RC:23876 TS:1991238 OFFSET:122682160 > row group 9: RC:22584 TS:2069463 OFFSET:123303246 > row group 10: RC:21225 TS:1955748 OFFSET:123960700 > row group 11: RC:19960 TS:1931889 OFFSET:124575333 > row group 12: RC:18806 TS:1725871 OFFSET:125132862 > row group 13: RC:17719 TS:1653309 OFFSET:125668057 > row group 14: RC:1617743 TS:157973949 OFFSET:134217728{code} > > One thing to notice is, that this file was written in a Spark application > running on 13 executors. Is it possible, that local data is in the big row > group and the remote reads go into seperate (small) row groups? The shuffle > is involved, because data is read with distribute by clause. > > Is this a known bug? Is there a workaround to get even row group sizes? I > want to decrease the row group size using > sc.hadoopConfiguration.setInt("parquet.block.size", 64 * 1024 * 1024) > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org