huaxingao commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1132363307

   Thanks @aokolnychyi for the proposal. I agree that we should support both 
strictly required distribution and best effort distribution. For best effort 
distribution, if user doesn't request a specific number of partitions, we will 
split skewed partitions and coalesce small partitions. For strictly required 
distribution, if user doesn't request a specific number of partitions, we will 
coalesce small partitions but we will NOT split skewed partitions since this 
changes the required distribution.
   
   In interface `RequiresDistributionAndOrdering`, I will add
   ```
   default boolean distributionStrictlyRequired() { return true; }
   ```
   Then in `DistributionAndOrderingUtils`.`prepareQuery`, I will change the 
code to something like this:
   ```      
         val queryWithDistribution = if (distribution.nonEmpty) {
           if (!write.distributionStrictlyRequired() && numPartitions == 0) {
             RebalancePartitions(distribution, query)
           } else {
             if (numPartitions > 0) {
               RepartitionByExpression(distribution, query, numPartitions)
             } else {
               RepartitionByExpression(distribution, query, None)
             }
           }
           ...
   ``` 
   Basically, in the best effort case, if the requested numPartitions is 0, we 
will use `RebalancePartitions` so both `OptimizeSkewInRebalancePartitions` and 
`CoalesceShufflePartitions` will be applied. In the strictly required 
distribution case,  if the requested numPartitions is 0, we will use 
`RepartitionByExpression(distribution, query, None)` so 
`CoalesceShufflePartitions` will be applied. 
   
   Does this sound correct for every one?
   
   
   
   
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to