[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644224#comment-17644224 ] Apache Spark commented on SPARK-41386: -- User 'Juerin-Dong' has created a pull request for this issue: https://github.com/apache/spark/pull/38965 > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644223#comment-17644223 ] Apache Spark commented on SPARK-41386: -- User 'Juerin-Dong' has created a pull request for this issue: https://github.com/apache/spark/pull/38965 > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644175#comment-17644175 ] Zhe Dong commented on SPARK-41386: -- Hi. [~podongfeng] That was my mistake. I removed it. sorry for that. > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644144#comment-17644144 ] Ruifeng Zheng commented on SPARK-41386: --- [~dongz] I think this ticket is irrelevant to Spark-Connect? > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644141#comment-17644141 ] Zhe Dong commented on SPARK-41386: -- {noformat} if (mapStats.isEmpty || mapStats.get.bytesByPartitionId.forall(_ <= advisorySize && _ >= advisorySize * smallPartitionFactor )) { return shuffle } if (bytes > targetSize) { ... } else if ( bytes < targetSize * smallPartitionFactor ){ CoalescedPartitionSpec(reduceIndex, reduceIndex + 1, bytes) :: Nil }else { return shuffle // dummy }{noformat} > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we except that files size should be bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)
[ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644131#comment-17644131 ] Zhe Dong commented on SPARK-41386: -- we may change this part to avoid files that are smaller than "spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor" [https://github.com/apache/spark/blob/d9c7908f348fa7771182dca49fa032f6d1b689be/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewInRebalancePartitions.scala#L75] > There are some small files when using rebalance(column) > --- > > Key: SPARK-41386 > URL: https://issues.apache.org/jira/browse/SPARK-41386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Zhe Dong >Priority: Minor > > *Problem ( REBALANCE(column)* {*}){*}: > SparkSession config: > {noformat} > config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", > "true") > config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") > config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", > "0.5"){noformat} > so, we excepted files size are bigger than 20m*0.5=10m at least. > but in fact , we got some small files like the following: > {noformat} > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 12.1 M 2022-12-07 13:13 > .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 9.1 M 2022-12-07 13:13 > .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet > -rw-r--r-- 1 jp28948 staff 3.0 M 2022-12-07 13:13 > .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat} > 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in > another way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org