Bruce Robbins created SPARK-37175:
-------------------------------------

             Summary: Performance improvement to hash joins with many duplicate 
keys
                 Key: SPARK-37175
                 URL: https://issues.apache.org/jira/browse/SPARK-37175
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.3.0
            Reporter: Bruce Robbins
         Attachments: hash_rel_examples.txt

I noticed that HashedRelations with many duplicate keys perform significantly 
slower than HashedRelations with similar number of entries but few or no 
duplicate keys.

A hypothesis:
 * Because of the order in which rows are appended to the map, rows for a given 
key are typically non-adjacent in memory, resulting in poor locality.
 * The map would perform better if all rows for a given key are next to each 
other in memory.

To test this hypothesis, I made a [somewhat brute force change to 
HashedRelation|https://github.com/apache/spark/compare/master...bersprockets:hash_rel_play]
 to reorganize the map such that all rows for a given key are adjacent in 
memory. This yielded some performance improvements, at least in my contrived 
examples:

(Run on a Intel-based MacBook Pro with 4 cores/8 hyperthreads):

Example 1:
 Shuffled Hash Join, LongHashedRelation:
 Stream side: 300M rows
 Build side: 90M rows, but only 200K unique keys
 136G output rows
|Join strategy|Time (in seconds)|Notes|
|Shuffled hash join (No reorganization)|1092|
|Shuffled hash join (with reorganization)|234|4.6 times faster than regular SHJ|
|Sort merge join|164|This beats the SHJ when there are lots of duplicate keys, 
I presume because of better cache locality on both sides of the join|

Example 2:
 Broadcast Hash Join, LongHashedRelation:
 Stream side: 350M rows
 Build side 9M rows, but only 18K unique keys
 175G output rows
|Join strategy|Time (in seconds)|Notes|
|Broadcast hash join (No reorganization)|872| |
|Broadcast hash join (with reorganization)|263|3 times faster than regular BHJ|
|Sort merge join|174|This beats the BHJ when there are lots of duplicate keys, 
I presume because of better cache locality on both sides of the join|

Example 3:
 Shuffled Hash Join, UnsafeHashedRelation
 Stream side: 300M rows
 Build side 90M rows, but only 200K unique keys
 135G output rows
|Join strategy|Time (in seconds)|Notes|
|Shuffled Hash Join (No reorganization)|3154| |
|Shuffled Hash Join (with reorganization)|533|5.9 times faster|
|Sort merge join|190|This beats the SHJ when there are lots of duplicate keys, 
I presume because of better cache locality on both sides of the join|

Example 4:
 Broadcast Hash Join, UnsafeHashedRelation:
 Stream side: 70M rows
 Build side 9M rows, but only 18K unique keys
 35G output rows
|Join strategy|Time (in seconds)|Notes|
|Broadcast hash join (No reorganization)|849| |
|Broadcast hash join (with reorganization)|130|6.5 times faster|
|Sort merge join|46|This beats the BHJ when there are lots of duplicate keys, I 
presume because of better cache locality on both sides of the join|

Even the brute force approach could be useful in production if
 * Toggled by a feature flag
 * Reorganizes only if the ratio of keys to rows drops below some threshold
 * Falls back to using the original map if building the new map results in a 
memory-related SparkException.

Another incidental lesson is that sort merge join seems to outperform broadcast 
hash join when the build side has lots of duplicate keys. So maybe a long term 
improvement would be to avoid hash joins (broadcast or shuffle) if there are 
many duplicate keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to