cloud-fan commented on pull request #32816: URL: https://github.com/apache/spark/pull/32816#issuecomment-872439396
I think using the `CostEvaluator` to accept extra shuffles introduced by skew join handling is a good idea. However, the current framework is too simple: we just give up the entire re-optimization if the cost becomes higher, while we should try to interact with the re-optimization process to get the plan with the lowest cost. For example, if skew join handling adds extra shuffles, we should only give up skew join handling, instead of the entire re-optimization. My idea is, the re-optimization can produce 2 result plans: one with skew join handling and one without. Then we compare the cost: if the plan with skew join handling has a higher cost and force-apply is not enabled, we pick the other plan. Otherwise, we pick the plan with skew join handling. Then we don't need to change the `CostEvaluator`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org