[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024913#comment-15024913
 ] 

Aman Sinha commented on DRILL-4119:
-----------------------------------

Our hash64 implementation looks similar to the original one but I haven't done 
enough analysis to say they are exactly the same.  The only way to check is 
through testing.  Here are 2 values and their corresponding hash from the 
original (note, for some reason the command line utility xxh64sum does not read 
multiple lines from a file, so I had to break up the values into separate 
files): 
{noformat}
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat > sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample1.csv
1a883d005e0ce003b918d737ac697e7c
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample1.csv
1213a50f060e0659  sample1.csv
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample2.csv
e0658433041ce9aa  sample2.csv
{noformat}

These values don't match the value I am getting from Drill  after doing the 
conversion of the long to hex (I used Long.toHexString() method in debugger to 
convert), so it is possible something may have gotten lost in translation. 

> Skew in hash distribution for varchar (and possibly other) types of data
> ------------------------------------------------------------------------
>
>                 Key: DRILL-4119
>                 URL: https://issues.apache.org/jira/browse/DRILL-4119
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.3.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02          HashAgg(group=[{0}])
> 01-03            Project(SomeId=[$0])
> 01-04              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02                    HashAgg(group=[{0}])
> 03-03                      Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to