[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhe Dong updated SPARK-41386: ----------------------------- Description: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we excepted files size are bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00000-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00001-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00002-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00003-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-00004-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-00005-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. was: *Problem ( REBALANCE(column)* {*}){*}: SparkSession config: {noformat} config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat} so, we excepted files size are bigger than 20m*0.5=10m at least. but in fact , we got some small files like the following: {noformat} -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00000-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00001-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00002-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 .../part-00003-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 .../part-00004-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 .../part-00005-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way. > There are some small files when using rebalance(column) > ------------------------------------------------------- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.4.0 > Reporter: Zhe Dong > Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we excepted files size are bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-00000-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-00001-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-00002-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-00003-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-00004-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-00005-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org