[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024913#comment-15024913 ]
Aman Sinha commented on DRILL-4119: ----------------------------------- Our hash64 implementation looks similar to the original one but I haven't done enough analysis to say they are exactly the same. The only way to check is through testing. Here are 2 values and their corresponding hash from the original (note, for some reason the command line utility xxh64sum does not read multiple lines from a file, so I had to break up the values into separate files): {noformat} Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat > sample2.csv e4b4388e8865819126cb0e4dcaa7261d Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample1.csv 1a883d005e0ce003b918d737ac697e7c Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample2.csv e4b4388e8865819126cb0e4dcaa7261d Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample1.csv 1213a50f060e0659 sample1.csv Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample2.csv e0658433041ce9aa sample2.csv {noformat} These values don't match the value I am getting from Drill after doing the conversion of the long to hex (I used Long.toHexString() method in debugger to convert), so it is possible something may have gotten lost in translation. > Skew in hash distribution for varchar (and possibly other) types of data > ------------------------------------------------------------------------ > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 1.3.0 > Reporter: Aman Sinha > Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03 Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01 UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02 HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)