[GitHub] [spark] cloud-fan commented on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

2021-04-26 Thread GitBox
cloud-fan commented on pull request #32210: URL: https://github.com/apache/spark/pull/32210#issuecomment-826239123 After more thinking, I'm wondering if this is the right direction to go. Apparently falling back to SMJ wastes the partially-built hash map. If one partition is a bit

[GitHub] [spark] cloud-fan commented on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

2021-04-24 Thread GitBox
cloud-fan commented on pull request #32210: URL: https://github.com/apache/spark/pull/32210#issuecomment-826239123 After more thinking, I'm wondering if this is the right direction to go. Apparently falling back to SMJ wastes the partially-built hash map. If one partition is a bit

[GitHub] [spark] cloud-fan commented on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

2021-04-21 Thread GitBox
cloud-fan commented on pull request #32210: URL: https://github.com/apache/spark/pull/32210#issuecomment-824536267 retest this please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] cloud-fan commented on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

2021-04-21 Thread GitBox
cloud-fan commented on pull request #32210: URL: https://github.com/apache/spark/pull/32210#issuecomment-823919561 > We enabled shuffled hash join by default with this feature. In our environment, roughly 25% of sort merge join queries are now running with shuffled hash join after

[GitHub] [spark] cloud-fan commented on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

2021-04-20 Thread GitBox
cloud-fan commented on pull request #32210: URL: https://github.com/apache/spark/pull/32210#issuecomment-823357554 I'm a bit worried about this solution: 1. sorting the stream-side at runtime may lead to slow query plan because the sort is not whole-stage-codegen-ed. 2. unlike SMJ,