aokolnychyi commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1132014116

   Thanks for the PR, @huaxingao. I think it is a great feature and it would be 
awesome to get it done.
   
   I spent some time thinking about this and have a few questions/proposals.
   
   If I understand correctly, we currently hard-code the number of shuffle 
partitions in `RepartitionByExpression`, which prohibits both coalescing and 
skew split optimizations.
   
   It seems reasonable to support cases when the requested distribution is 
best-effort but I also think there are valid cases when the distribution is 
required for correctness and it is actually the current API contract. What 
about extending `RequiredDistributionAndOrdering` to indicate the distribution 
is not strictly required? We can add some boolean method and default it to keep 
the existing behavior. If the distribution is required, we can still benefit 
from coalescing as I think `CoalesceShufflePartitions` and `AQEShuffleReadExec` 
would keep the original distribution in coalesce cases. That’s already a huge 
win. We can avoid too small files while keeping the requested distribution.
   
   I also agree about using `RebalancePartitions` when the distribution is not 
strictly required. What about extending `RebalancePartitions` to also support 
range partitioning? It currently supports only hash and round-robin. If we make 
that change, we will be able to remove unnecessary shuffles in the optimizer 
and keep the original distribution as long as there is no skew and we only 
coalesce. If there is a skew, an extra shuffle and changed distribution seems 
like a reasonable overhead.
   
   What does everybody else think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to