[ https://issues.apache.org/jira/browse/SPARK-37175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-37175: ------------------------------------ Assignee: (was: Apache Spark) > Performance improvement to hash joins with many duplicate keys > -------------------------------------------------------------- > > Key: SPARK-37175 > URL: https://issues.apache.org/jira/browse/SPARK-37175 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.3.0 > Reporter: Bruce Robbins > Priority: Major > Attachments: hash_rel_examples.txt > > > I noticed that HashedRelations with many duplicate keys perform significantly > slower than HashedRelations with similar number of entries but few or no > duplicate keys. > A hypothesis: > * Because of the order in which rows are appended to the map, rows for a > given key are typically non-adjacent in memory, resulting in poor locality. > * The map would perform better if all rows for a given key are next to each > other in memory. > To test this hypothesis, I made a [somewhat brute force change to > HashedRelation|https://github.com/apache/spark/compare/master...bersprockets:hash_rel_play] > to reorganize the map such that all rows for a given key are adjacent in > memory. This yielded some performance improvements, at least in my contrived > examples: > (Run on a Intel-based MacBook Pro with 4 cores/8 hyperthreads): > Example 1: > Shuffled Hash Join, LongHashedRelation: > Stream side: 300M rows > Build side: 90M rows, but only 200K unique keys > 136G output rows > |Join strategy|Time (in seconds)|Notes| > |Shuffled hash join (No reorganization)|1092| | > |Shuffled hash join (with reorganization)|234|4.6 times faster than regular > SHJ| > |Sort merge join|164|This beats the SHJ when there are lots of duplicate > keys, I presume because of better cache locality on both sides of the join| > Example 2: > Broadcast Hash Join, LongHashedRelation: > Stream side: 350M rows > Build side 9M rows, but only 18K unique keys > 175G output rows > |Join strategy|Time (in seconds)|Notes| > |Broadcast hash join (No reorganization)|872| | > |Broadcast hash join (with reorganization)|263|3 times faster than regular > BHJ| > |Sort merge join|174|This beats the BHJ when there are lots of duplicate > keys, I presume because of better cache locality on both sides of the join| > Example 3: > Shuffled Hash Join, UnsafeHashedRelation > Stream side: 300M rows > Build side 90M rows, but only 200K unique keys > 135G output rows > |Join strategy|Time (in seconds)|Notes| > |Shuffled Hash Join (No reorganization)|3154| | > |Shuffled Hash Join (with reorganization)|533|5.9 times faster| > |Sort merge join|190|This beats the SHJ when there are lots of duplicate > keys, I presume because of better cache locality on both sides of the join| > Example 4: > Broadcast Hash Join, UnsafeHashedRelation: > Stream side: 70M rows > Build side 9M rows, but only 18K unique keys > 35G output rows > |Join strategy|Time (in seconds)|Notes| > |Broadcast hash join (No reorganization)|849| | > |Broadcast hash join (with reorganization)|130|6.5 times faster| > |Sort merge join|46|This beats the BHJ when there are lots of duplicate keys, > I presume because of better cache locality on both sides of the join| > The code for these examples is attached here [^hash_rel_examples.txt] > Even the brute force approach could be useful in production if > * Toggled by a feature flag > * Reorganizes only if the ratio of keys to rows drops below some threshold > * Falls back to using the original map if building the new map results in a > memory-related SparkException. > Another incidental lesson is that sort merge join seems to outperform > broadcast hash join when the build side has lots of duplicate keys. So maybe > a long term improvement would be to avoid hash joins (broadcast or shuffle) > if there are many duplicate keys. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org