[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-673945250 @agrawaldevesh @maropu @viirya any more comments? The benchmark shows that the previous "store matched bit in value payload" approach and the current "bitset/hashset"

[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-13 Thread GitBox
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-673347928 retest this please This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-12 Thread GitBox
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-672640531 let's compare the overheads of these 2 approaches. The current approach (put "matched bit" in the value payload): 1. needs to do a project over the build side rows to

[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-11 Thread GitBox
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-672165234 yea sounds good! This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-11 Thread GitBox
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671804290 @c21 yea this is a hard problem. We can probably add a different iterator implementation in `BytesToBytesMap`, which iterates the `longArray` first, get the key

[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-10 Thread GitBox
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671719927 A few more thoughts: 1. For `keyIsUnique` code path, we know it's one key one value, I think we can still use bitset. 2. We don't need to get the value index. We can

[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-10 Thread GitBox
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671519042 ah good point about one key multi value. How about we use a standard hash set and use `(keyIndex, value_index)` as the key?

[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-10 Thread GitBox
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671505935 sorry I may miss something. I thought it would be ``` for (row <- inputs) { val match = hashedRelation.get(getKey(row)) if (match != null && joinCondition(row,

[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-10 Thread GitBox
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671497466 Yea let's use standard bitset. It's new code path anyway and we can improve later. This is an automated

[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-10 Thread GitBox
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671484895 > I am curious if the approach of storing the 'matched rows' out of band was considered ? The join algorithm could be extended to keep say an auxiliary struct of matched keys