[ https://issues.apache.org/jira/browse/DRILL-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674746#comment-16674746 ]
Boaz Ben-Zvi commented on DRILL-6825: ------------------------------------- I didn't explain the last point well enough; here is an example: INT: 0789 BIGINT: 00000789 For example, the hash-function starts from the left most byte, with seed zero, hashes every byte and left shifts the result, to be used as the seed for hashing the next byte. So for INT: hash(0,0) => hash(7,0) => hash(8,X) =>hash(9,Y) which would give +*the same result*+ for the BIGINT: hash(0,0) => hash(0,0) => hash(0,0) =>......=> hash(9,Y) An example where this can be useful: Hash Join performs a probe with an INT key, reading the first input batch, then the next batch changes the key to a BIGINT. Today this would be a "schema change" error. But we would like to be handle such a "schema evolution" in the future, and if those BIGINT values would work the same in the hash-table, this "schema evolution" would be much much simpler. > Applying different hash function according to data types and data size > ---------------------------------------------------------------------- > > Key: DRILL-6825 > URL: https://issues.apache.org/jira/browse/DRILL-6825 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Codegen > Reporter: weijie.tong > Priority: Major > Fix For: 1.16.0 > > > Different hash functions have different performance according to different > data types and data size. We should choose a right one to apply not just > Murmurhash. -- This message was sent by Atlassian JIRA (v7.6.3#76005)