[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677804#comment-16677804
 ] 

Teddy Choi commented on HIVE-20873:
-----------------------------------

In my case, TPC-H query 21 and TPC-DS query 16 seem related with it. TPC-H 
query 21 uses map join, and TPC-DS query 16 uses group by. Both of them use 
VectorHashKeyWrapperBatch, which uses VectorHashKeyWrapperSingleLong, which 
uses HashCodeUtil.calculateLongHashCode.

Also there are other hash algorithms, but Murmur3 is already used in Hadoop and 
Hive. See org.apache.hive.common.util.Murmur3 and 
org.apache.hadoop.util.hash.MurmurHash. So I think it would be safe to use 
Murmur3 instead of benchmarking other hash algorithms.

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> ------------------------------------------------------------------------
>
>                 Key: HIVE-20873
>                 URL: https://issues.apache.org/jira/browse/HIVE-20873
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Teddy Choi
>            Assignee: Teddy Choi
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to