c21 commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671742514


   @cloud-fan, @agrawaldevesh, @maropu and @viirya -
   
   I took a more closer look inside `BytesToBytesMap.java`, and found it would 
probably be hard / hacky to get key index when iterating all values of map. 
   
   After reading the stream side of join, we need to iterate all rows on build 
side `BytesToBytesMap` to output build side rows not having a match. To iterate 
all rows on the map, `BytesToBytesMap` provides a method 
[`iterator()`](https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java#L412).
 The method is to iterate through all data (key-value pair) in each memory page 
of `dataPages`. The approach only reads through data pages, and does not 
interact with the key index array `longArray` at all. So we could not get key 
index efficiently here.
   
   A workaround would be for every returned key-value pair inside 
`MapIterator`, we call `lookup(key, ...)` again on map to get key index. But we 
need a probing for every row on build side which seems to be obviously 
inefficient.
   
   How do you guys think? Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to