[jira] [Assigned] (SPARK-37175) Performance improvement to hash joins with many duplicate keys

Apache Spark (Jira) Tue, 28 Dec 2021 18:24:06 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-37175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-37175:
------------------------------------

    Assignee:     (was: Apache Spark)

> Performance improvement to hash joins with many duplicate keys
> --------------------------------------------------------------
>
>                 Key: SPARK-37175
>                 URL: https://issues.apache.org/jira/browse/SPARK-37175
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Bruce Robbins
>            Priority: Major
>         Attachments: hash_rel_examples.txt
>
>
> I noticed that HashedRelations with many duplicate keys perform significantly 
> slower than HashedRelations with similar number of entries but few or no 
> duplicate keys.
> A hypothesis:
>  * Because of the order in which rows are appended to the map, rows for a 
> given key are typically non-adjacent in memory, resulting in poor locality.
>  * The map would perform better if all rows for a given key are next to each 
> other in memory.
> To test this hypothesis, I made a [somewhat brute force change to 
> HashedRelation|https://github.com/apache/spark/compare/master...bersprockets:hash_rel_play]
>  to reorganize the map such that all rows for a given key are adjacent in 
> memory. This yielded some performance improvements, at least in my contrived 
> examples:
> (Run on a Intel-based MacBook Pro with 4 cores/8 hyperthreads):
> Example 1:
>  Shuffled Hash Join, LongHashedRelation:
>  Stream side: 300M rows
>  Build side: 90M rows, but only 200K unique keys
>  136G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled hash join (No reorganization)|1092| |
> |Shuffled hash join (with reorganization)|234|4.6 times faster than regular 
> SHJ|
> |Sort merge join|164|This beats the SHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 2:
>  Broadcast Hash Join, LongHashedRelation:
>  Stream side: 350M rows
>  Build side 9M rows, but only 18K unique keys
>  175G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|872| |
> |Broadcast hash join (with reorganization)|263|3 times faster than regular 
> BHJ|
> |Sort merge join|174|This beats the BHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 3:
>  Shuffled Hash Join, UnsafeHashedRelation
>  Stream side: 300M rows
>  Build side 90M rows, but only 200K unique keys
>  135G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled Hash Join (No reorganization)|3154| |
> |Shuffled Hash Join (with reorganization)|533|5.9 times faster|
> |Sort merge join|190|This beats the SHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 4:
>  Broadcast Hash Join, UnsafeHashedRelation:
>  Stream side: 70M rows
>  Build side 9M rows, but only 18K unique keys
>  35G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|849| |
> |Broadcast hash join (with reorganization)|130|6.5 times faster|
> |Sort merge join|46|This beats the BHJ when there are lots of duplicate keys, 
> I presume because of better cache locality on both sides of the join|
> The code for these examples is attached here [^hash_rel_examples.txt]
> Even the brute force approach could be useful in production if
>  * Toggled by a feature flag
>  * Reorganizes only if the ratio of keys to rows drops below some threshold
>  * Falls back to using the original map if building the new map results in a 
> memory-related SparkException.
> Another incidental lesson is that sort merge join seems to outperform 
> broadcast hash join when the build side has lots of duplicate keys. So maybe 
> a long term improvement would be to avoid hash joins (broadcast or shuffle) 
> if there are many duplicate keys.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37175) Performance improvement to hash joins with many duplicate keys

Reply via email to