[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644224#comment-17644224
 ] 

Apache Spark commented on SPARK-41386:
--

User 'Juerin-Dong' has created a pull request for this issue:
https://github.com/apache/spark/pull/38965

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644223#comment-17644223
 ] 

Apache Spark commented on SPARK-41386:
--

User 'Juerin-Dong' has created a pull request for this issue:
https://github.com/apache/spark/pull/38965

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644175#comment-17644175
 ] 

Zhe Dong commented on SPARK-41386:
--

Hi. [~podongfeng] 

That was my mistake. I removed it. sorry for that.

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644144#comment-17644144
 ] 

Ruifeng Zheng commented on SPARK-41386:
---

[~dongz] I think this ticket is irrelevant to Spark-Connect? 

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644141#comment-17644141
 ] 

Zhe Dong commented on SPARK-41386:
--

 
{noformat}
    if (mapStats.isEmpty ||
      mapStats.get.bytesByPartitionId.forall(_ <= advisorySize && _ >= 
advisorySize * smallPartitionFactor )) {
      return shuffle
    }


      if (bytes > targetSize) {
        ... 
      } else if ( bytes < targetSize * smallPartitionFactor ){
           CoalescedPartitionSpec(reduceIndex, reduceIndex + 1, bytes) :: Nil
  }else {        
   return shuffle // dummy
       }{noformat}
 

 

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-06 Thread Zhe Dong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644131#comment-17644131
 ] 

Zhe Dong commented on SPARK-41386:
--

we may change this part to avoid files that are smaller than 
"spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor"

[https://github.com/apache/spark/blob/d9c7908f348fa7771182dca49fa032f6d1b689be/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewInRebalancePartitions.scala#L75]
 

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we excepted files size are bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org