c21 commented on pull request #32210:
URL: https://github.com/apache/spark/pull/32210#issuecomment-854356895


   Sorry for the late update. Went through 
https://cs-people.bu.edu/mathan/reading-groups/papers-classics/join.pdf and 
https://www.db-book.com/db4/slide-dir/ch13.ppt, and I have a new proposal for 
this feature as followed.
   
   At the high level, the feature is:
   
   1. Build the hash table (`HashedRelation`) on build side.
   2. If step 1 succeeds, do shuffled hash join with stream side.
   3. If step 1 fails (either due to OOM `SparkOutOfMemoryError`, or a 
configurable threshold, more details below)
     3.1. Add rest of rows from build side to a sorter 
(`UnsafeExternalRowSorter.insertRow`, e.g. the sorter called `buildSorter`).
     3.2. Process rows from streamed side. Probe hash table first to do 
shuffled hash join. If having matched rows, output and process next row. If no 
matched rows, add the row from stream side to another sorter (`streamedSorter`).
     3.3. After processing all rows from streamed side, clean up hash table, 
and sort `buildSorter` and `streamedSorter` and do sort merge join.
   
   In this way, the effort of building hash table is not wasted compared to 
this PR, and we can control how to do the fallback. Three ways of fallback 
policy I can think of  - (1).default `SparkOutOfMemoryError` in 
`BytesToBytesMap`, (2).introduce configurable threshold for key capacity in 
`BytesToBytesMap` (max number of keys, `BytesToBytesMap.capacity`), 
(3).introduce configurable threshold for used page size in `BytesToBytesMap` 
(max size of keys+values, `BytesToBytesMap.used`).
   
   This solution mimics the idea of `hybrid join` in 
https://cs-people.bu.edu/mathan/reading-groups/papers-classics/join.pdf, where 
it tried to build hash table, and spill rest of rows on disk, and join rest of 
rows on disk later.
   
   WDYT? @cloud-fan, @maropu and @sigmod, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to