[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15020638#comment-15020638 ]
Jacques Nadeau commented on DRILL-4119: --------------------------------------- Interesting finding. As we've been stung by issues around hash functions before, it seems like we need to have a hash distribution test suite, especially when we make these kinds of changes. Each time we have an issue, then we can add that to the suite. I know one of the issues we had before was hashing null with another value (which we fixed with chaining). I can't remember what other issues we've had. Your proposal seems reasonable. > Skew in hash distribution for varchar (and possibly other) types of data > ------------------------------------------------------------------------ > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 1.3.0 > Reporter: Aman Sinha > Assignee: Aman Sinha > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03 Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01 UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02 HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)