[ 
https://issues.apache.org/jira/browse/DRILL-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674746#comment-16674746
 ] 

Boaz Ben-Zvi commented on DRILL-6825:
-------------------------------------

I didn't explain the last point well enough; here is an example:
 INT: 0789
 BIGINT: 00000789

For example, the hash-function starts from the left most byte, with seed zero, 
hashes every byte and left shifts the result, to be used as the seed for 
hashing the next byte.

So for INT: hash(0,0) => hash(7,0) => hash(8,X) =>hash(9,Y)
 which would give +*the same result*+ for the BIGINT: hash(0,0) => hash(0,0) => 
hash(0,0) =>......=> hash(9,Y)

An example where this can be useful: Hash Join performs a probe with an INT 
key, reading the first input batch, then the next batch changes the key to a 
BIGINT. Today this would be a "schema change" error. But we would like to be 
handle such a "schema evolution" in the future, and if those BIGINT values 
would work the same in the hash-table, this "schema evolution" would be much 
much simpler.

 

 

> Applying different hash function according to data types and data size
> ----------------------------------------------------------------------
>
>                 Key: DRILL-6825
>                 URL: https://issues.apache.org/jira/browse/DRILL-6825
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Codegen
>            Reporter: weijie.tong
>            Priority: Major
>             Fix For: 1.16.0
>
>
> Different hash functions have different performance according to different 
> data types and data size. We should choose a right one to apply not just 
> Murmurhash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to