huaxingao commented on PR #34785: URL: https://github.com/apache/spark/pull/34785#issuecomment-1132363307
Thanks @aokolnychyi for the proposal. I agree that we should support both strictly required distribution and best effort distribution. For best effort distribution, if user doesn't request a specific number of partitions, we will split skewed partitions and coalesce small partitions. For strictly required distribution, if user doesn't request a specific number of partitions, we will coalesce small partitions but we will NOT split skewed partitions since this changes the required distribution. In interface `RequiresDistributionAndOrdering`, I will add ``` default boolean distributionStrictlyRequired() { return true; } ``` Then in `DistributionAndOrderingUtils`.`prepareQuery`, I will change the code to something like this: ``` val queryWithDistribution = if (distribution.nonEmpty) { if (!write.distributionStrictlyRequired() && numPartitions == 0) { RebalancePartitions(distribution, query) } else { if (numPartitions > 0) { RepartitionByExpression(distribution, query, numPartitions) } else { RepartitionByExpression(distribution, query, None) } } ... ``` Basically, in the best effort case, if the requested numPartitions is 0, we will use `RebalancePartitions` so both `OptimizeSkewInRebalancePartitions` and `CoalesceShufflePartitions` will be applied. In the strictly required distribution case, if the requested numPartitions is 0, we will use `RepartitionByExpression(distribution, query, None)` so `CoalesceShufflePartitions` will be applied. Does this sound correct for every one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org