[GitHub] [spark] wangshisan edited a comment on pull request #29266: [SPARK-32464][SQL] Support skew handling on join that has one side wi…

GitBox Tue, 04 Aug 2020 19:00:07 -0700


wangshisan edited a comment on pull request #29266:
URL: https://github.com/apache/spark/pull/29266#issuecomment-668926319



   > Yea I'm also wondering the approach here. The skew join handling needs to 
split the skew side, and repeat the other side. I don't think we can split the 
buckets of bucketed table, and I'm not sure how we are going to read buckets 
repeatedly from a bucketed table.
   
   Yeah, that's right, we cannot split the bucket table side. But we can 
duplicate the bucket side, just leverage the RDD mechanism, try to duplicate 
some parent partitions in the child RDD.
   For instance, we have a RDD A with partitions (0, 1, 2, 3), and now we need 
duplicate the second partition (partition 1). We can just create a new RDD, B 
for example, with partition (0, 1, 2, 3, 4), and guarantee the dependency 
relationship: 
   - RDD B partition 0 <- RDD A partition 0
   - RDD B partition 1 <- RDD A partition 1
   - RDD B partition 2 <- RDD A partition 1
   - RDD B partition 3 <- RDD A partition 2
   - RDD B partition 4 <- RDD A partition 3
   
   And this is the new class RecombinationedRDD designed for.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangshisan edited a comment on pull request #29266: [SPARK-32464][SQL] Support skew handling on join that has one side wi…

Reply via email to