[GitHub] [spark] wangshisan commented on pull request #29266: [SPARK-32464][SQL] Support skew handling on join that has one side wi…

2020-08-04 Thread GitBox


wangshisan commented on pull request #29266:
URL: https://github.com/apache/spark/pull/29266#issuecomment-668932395


   > this is with AQE? if so can we please add that to description and it might 
be nice to describe approach taken to handle it in description as well.
   
   Added.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangshisan commented on pull request #29266: [SPARK-32464][SQL] Support skew handling on join that has one side wi…

2020-08-04 Thread GitBox


wangshisan commented on pull request #29266:
URL: https://github.com/apache/spark/pull/29266#issuecomment-668926319


   > Yea I'm also wondering the approach here. The skew join handling needs to 
split the skew side, and repeat the other side. I don't think we can split the 
buckets of bucketed table, and I'm not sure how we are going to read buckets 
repeatedly from a bucketed table.
   
   Yeah, that's right, we cannot split the bucket table side. But we can 
duplicate the bucket side, just leverage the RDD mechanism, try to duplicate 
some parent partitions in the child RDD.
   For instance, we have a RDD A with partitions (0, 1, 2, 3), and now we need 
duplicate the second partition (partition 1). We can just create a new RDD, B 
for example, with partition (0, 1, 2, 3, 4), and guarantee the mapping 
relationship: 
   - RDD B partition 0 <- RDD A partition 0
   - RDD B partition 1 <- RDD A partition 1
   - RDD B partition 2 <- RDD A partition 1
   - RDD B partition 3 <- RDD A partition 2
   - RDD B partition 4 <- RDD A partition 3
   
   And this is the new class RecombinationedRDD designed for.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangshisan commented on pull request #29266: [SPARK-32464][SQL] Support skew handling on join that has one side wi…

2020-08-02 Thread GitBox


wangshisan commented on pull request #29266:
URL: https://github.com/apache/spark/pull/29266#issuecomment-667772074


   @cloud-fan @JkSelf  Could you have a look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org