c21 commented on pull request #32210: URL: https://github.com/apache/spark/pull/32210#issuecomment-854356895
Sorry for the late update. Went through https://cs-people.bu.edu/mathan/reading-groups/papers-classics/join.pdf and https://www.db-book.com/db4/slide-dir/ch13.ppt, and I have a new proposal for this feature as followed. At the high level, the feature is: 1. Build the hash table (`HashedRelation`) on build side. 2. If step 1 succeeds, do shuffled hash join with stream side. 3. If step 1 fails (either due to OOM `SparkOutOfMemoryError`, or a configurable threshold, more details below) 3.1. Add rest of rows from build side to a sorter (`UnsafeExternalRowSorter.insertRow`, e.g. the sorter called `buildSorter`). 3.2. Process rows from streamed side. Probe hash table first to do shuffled hash join. If having matched rows, output and process next row. If no matched rows, add the row from stream side to another sorter (`streamedSorter`). 3.3. After processing all rows from streamed side, clean up hash table, and sort `buildSorter` and `streamedSorter` and do sort merge join. In this way, the effort of building hash table is not wasted compared to this PR, and we can control how to do the fallback. Three ways of fallback policy I can think of - (1).default `SparkOutOfMemoryError` in `BytesToBytesMap`, (2).introduce configurable threshold for key capacity in `BytesToBytesMap` (max number of keys, `BytesToBytesMap.capacity`), (3).introduce configurable threshold for used page size in `BytesToBytesMap` (max size of keys+values, `BytesToBytesMap.used`). This solution mimics the idea of `hybrid join` in https://cs-people.bu.edu/mathan/reading-groups/papers-classics/join.pdf, where it tried to build hash table, and spill rest of rows on disk, and join rest of rows on disk later. WDYT? @cloud-fan, @maropu and @sigmod, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org