[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032062#comment-15032062 ]
Aman Sinha commented on DRILL-4119: ----------------------------------- [~mehant] would it make sense to open a separate JIRA for the underlying XXHash.hash64 implementation ? I feel that for hash32, we would still want to avoid down casting and instead use the mixing as proposed in this JIRA. If you agree, I can merge in my patch. > Skew in hash distribution for varchar (and possibly other) types of data > ------------------------------------------------------------------------ > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 1.3.0 > Reporter: Aman Sinha > Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03 Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01 UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02 HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)