cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-673945250
@agrawaldevesh @maropu @viirya any more comments?
The benchmark shows that the previous "store matched bit in value payload"
approach and the current "bitset/hashset"
cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-673347928
retest this please
This is an automated message from the Apache Git Service.
To respond to the message, please
cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-672640531
let's compare the overheads of these 2 approaches.
The current approach (put "matched bit" in the value payload):
1. needs to do a project over the build side rows to
cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-672165234
yea sounds good!
This is an automated message from the Apache Git Service.
To respond to the message, please l
cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671804290
@c21 yea this is a hard problem.
We can probably add a different iterator implementation in
`BytesToBytesMap`, which iterates the `longArray` first, get the key address,
cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671719927
A few more thoughts:
1. For `keyIsUnique` code path, we know it's one key one value, I think we
can still use bitset.
2. We don't need to get the value index. We can calc
cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671519042
ah good point about one key multi value. How about we use a standard hash
set and use `(keyIndex, value_index)` as the key?
--
cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671505935
sorry I may miss something. I thought it would be
```
for (row <- inputs) {
val match = hashedRelation.get(getKey(row))
if (match != null && joinCondition(row, m
cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671497466
Yea let's use standard bitset. It's new code path anyway and we can improve
later.
This is an automated messa
cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671484895
> I am curious if the approach of storing the 'matched rows' out of band was
considered ? The join algorithm could be extended to keep say an auxiliary
struct of matched keys
10 matches
Mail list logo