[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

GitBox Tue, 11 Aug 2020 11:00:45 -0700


c21 commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-672142777



   Thanks @cloud-fan and @maropu for feedback and discussion.
   So here is the new proposal of change:
   
   * `BytesToBytesMap.java`:
   Add a new iterator implementation to iterate on key index array `longArray`, 
and output key-value pair.
   `public MapIteratorWithKeyIndex iteratorWithKeyIndex()`
   
   * `HashedRelation.scala`
   Add two new methods in `HashedRelation`:
     * get values for the specified key with key index. 
   `def getWithKeyIndex(key: InternalRow): (Int, Iterator[InternalRow])`
   
     * get all values from map with key index
   `def valuesWithKeyIndex(): Iterator[(Int, InternalRow)]`
   
     * Plan to implement both methods for `UnsafeHashedRelation` in this PR, 
and leave `LongHashedRelation` with `UnsupportedOperationException ` for now.
   
   * `ShuffledHashJoinExec.scala`
   In method `ShuffledHashJoinExec.fullOuterJoin`:
     * if keys are unique in hash map:
   A `org.apache.spark.util.collection.BitSet` is used to store key index 
(`Int`) of matched build side row.
   
     * if keys are non-unique in hash map:
   The value index (`Int`) is computed based on order of rows returned by 
`HashedRelation.getWithKeyIndex/valuesWithKeyIndex` (they return the same order 
as the ordering in `BytesToBytesMap`).
   Key index (`Int`) and value index (`Int`) would be packed into one `Long` 
index (e.g. upper 4 bytes for key, lower 4 bytes for value), and a java 
`HashSet` is used to store this <key index, value index> of matched build side 
row (I still need to verify the feasibility to use `LongToUnsafeRowMap` instead 
of `HashSet` but I take using `HashSet` is acceptable here).
   
   Does it sound good as a plan? Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

Reply via email to