[GitHub] [spark] wangshisan commented on pull request #29266: [SPARK-32464][SQL] Support skew handling on join that has one side wi…
wangshisan commented on pull request #29266: URL: https://github.com/apache/spark/pull/29266#issuecomment-668932395 > this is with AQE? if so can we please add that to description and it might be nice to describe approach taken to handle it in description as well. Added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangshisan commented on pull request #29266: [SPARK-32464][SQL] Support skew handling on join that has one side wi…
wangshisan commented on pull request #29266: URL: https://github.com/apache/spark/pull/29266#issuecomment-668926319 > Yea I'm also wondering the approach here. The skew join handling needs to split the skew side, and repeat the other side. I don't think we can split the buckets of bucketed table, and I'm not sure how we are going to read buckets repeatedly from a bucketed table. Yeah, that's right, we cannot split the bucket table side. But we can duplicate the bucket side, just leverage the RDD mechanism, try to duplicate some parent partitions in the child RDD. For instance, we have a RDD A with partitions (0, 1, 2, 3), and now we need duplicate the second partition (partition 1). We can just create a new RDD, B for example, with partition (0, 1, 2, 3, 4), and guarantee the mapping relationship: - RDD B partition 0 <- RDD A partition 0 - RDD B partition 1 <- RDD A partition 1 - RDD B partition 2 <- RDD A partition 1 - RDD B partition 3 <- RDD A partition 2 - RDD B partition 4 <- RDD A partition 3 And this is the new class RecombinationedRDD designed for. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangshisan commented on pull request #29266: [SPARK-32464][SQL] Support skew handling on join that has one side wi…
wangshisan commented on pull request #29266: URL: https://github.com/apache/spark/pull/29266#issuecomment-667772074 @cloud-fan @JkSelf Could you have a look? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org