[GitHub] [spark] cloud-fan commented on pull request #32816: [SPARK-33832][SQL] Support optimize skewed join even if introduce extra shuffle
cloud-fan commented on pull request #32816: URL: https://github.com/apache/spark/pull/32816#issuecomment-918002643 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #32816: [SPARK-33832][SQL] Support optimize skewed join even if introduce extra shuffle
cloud-fan commented on pull request #32816: URL: https://github.com/apache/spark/pull/32816#issuecomment-906445777 cc @maryannxue to take another look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #32816: [SPARK-33832][SQL] Support optimize skewed join even if introduce extra shuffle
cloud-fan commented on pull request #32816: URL: https://github.com/apache/spark/pull/32816#issuecomment-872439396 I think using the `CostEvaluator` to accept extra shuffles introduced by skew join handling is a good idea. However, the current framework is too simple: we just give up the entire re-optimization if the cost becomes higher, while we should try to interact with the re-optimization process to get the plan with the lowest cost. For example, if skew join handling adds extra shuffles, we should only give up skew join handling, instead of the entire re-optimization. My idea is, the re-optimization can produce 2 result plans: one with skew join handling and one without. Then we compare the cost: if the plan with skew join handling has a higher cost and force-apply is not enabled, we pick the other plan. Otherwise, we pick the plan with skew join handling. Then we don't need to change the `CostEvaluator`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org