cloud-fan commented on a change in pull request #34634:
URL: https://github.com/apache/spark/pull/34634#discussion_r755239272



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala
##########
@@ -316,21 +316,25 @@ object ShufflePartitionsUtil extends Logging {
    * start of a partition.
    */
   // Visible for testing
-  private[sql] def splitSizeListByTargetSize(sizes: Seq[Long], targetSize: 
Long): Array[Int] = {
+  private[sql] def splitSizeListByTargetSize(
+      sizes: Seq[Long],
+      targetSize: Long,
+      minPartitionSize: Long): Array[Int] = {
     val partitionStartIndices = ArrayBuffer[Int]()
     partitionStartIndices += 0
     var i = 0
     var currentPartitionSize = 0L
     var lastPartitionSize = -1L
+    val minSize =
+      Math.min(targetSize, Math.max(minPartitionSize, targetSize * 
SMALL_PARTITION_FACTOR))
 
     def tryMergePartitions() = {
       // When we are going to start a new partition, it's possible that the 
current partition or
       // the previous partition is very small and it's better to merge the 
current partition into
       // the previous partition.
       val shouldMergePartitions = lastPartitionSize > -1 &&
         ((currentPartitionSize + lastPartitionSize) < targetSize * 
MERGED_PARTITION_FACTOR ||

Review comment:
       I see a conflict here:
   1. we want to make each partition smaller than `targetSize * 
MERGED_PARTITION_FACTOR`
   2. we want to make each partition larger than `targetSize * 
SMALL_PARTITION_FACTOR`
   
   Seems like we prioritize 2 over 1, but in some cases 
`SMALL_PARTITION_FACTOR` needs to be smaller.
   
   How about we make `SMALL_PARTITION_FACTOR` a parameter of the function? Then 
we can pass different values from skew join rule and rebalance rule. e.g. The 
value can be `0.1` in the rebalance rule, or even making it configurable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to