[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

GitBox Mon, 10 Aug 2020 22:55:09 -0700


c21 commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671742514

@cloud-fan, @agrawaldevesh, @maropu and @viirya -

I took a more closer look inside `BytesToBytesMap.java`, and found it would
probably be hard / hacky to get key index when iterating all values of map.

After reading the stream side of join, we need to iterate all rows on build
side `BytesToBytesMap` to output build side rows not having a match. To iterate
all rows on the map, `BytesToBytesMap` provides a method
[`iterator()`](https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java#L412).
The method is to iterate through all data (key-value pair) in each memory page
of `dataPages`. The approach only reads through data pages, and does not
interact with the key index array `longArray` at all. So we could not get key
index efficiently here.

A workaround would be for every returned key-value pair inside
`MapIterator`, we call `lookup(key, ...)` again on map to get key index. But we
need a probing for every row on build side which seems to be obviously
inefficient.

How do you guys think? Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

Reply via email to