[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032062#comment-15032062 ] Aman Sinha commented on DRILL-4119: --- [~mehant] would it make sense to open a separate JIRA for the underlying XXHash.hash64 implementation ? I feel that for hash32, we would still want to avoid down casting and instead use the mixing as proposed in this JIRA. If you agree, I can merge in my patch. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032229#comment-15032229 ] Mehant Baid commented on DRILL-4119: I think it makes sense to address that as a separate issue. Patch looks good otherwise. +1. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032791#comment-15032791 ] ASF GitHub Bot commented on DRILL-4119: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/279 > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024913#comment-15024913 ] Aman Sinha commented on DRILL-4119: --- Our hash64 implementation looks similar to the original one but I haven't done enough analysis to say they are exactly the same. The only way to check is through testing. Here are 2 values and their corresponding hash from the original (note, for some reason the command line utility xxh64sum does not read multiple lines from a file, so I had to break up the values into separate files): {noformat} Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat > sample2.csv e4b4388e8865819126cb0e4dcaa7261d Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample1.csv 1a883d005e0ce003b918d737ac697e7c Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample2.csv e4b4388e8865819126cb0e4dcaa7261d Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample1.csv 1213a50f060e0659 sample1.csv Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample2.csv e0658433041ce9aa sample2.csv {noformat} These values don't match the value I am getting from Drill after doing the conversion of the long to hex (I used Long.toHexString() method in debugger to convert), so it is possible something may have gotten lost in translation. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024951#comment-15024951 ] Mehant Baid commented on DRILL-4119: If we are returning different values from the original implementation then I feel we should fix that issue? I can help out to identify the differences. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025061#comment-15025061 ] Aman Sinha commented on DRILL-4119: --- Sure, if you want to try out the original version go for it... > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025071#comment-15025071 ] Zelaine Fong commented on DRILL-4119: - Per discussion at today's Drill hangout, Jacques mentioned that one of the differences resulting from the port is dealing with Java not supporting unsigneds. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024866#comment-15024866 ] Zelaine Fong commented on DRILL-4119: - [~amansinha100] - when you say "the original XXHash's C implementation and based on an initial analysis that one produces different hash value than our implementation and does not seem to have the same 'even number' pattern", are you saying that our current hash64 implementation is different from the original one? If yes, does that mean something got lost in the translation from C to Java? > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024802#comment-15024802 ] Aman Sinha commented on DRILL-4119: --- I did some more testing with the sample data. Here are 3 hash values: - hash64 is the native hash64 computed by XXHash.hash64() - hash64_downcast is the same value downcast to int - newhash is the the new 32 bit hash value computed by the proposed fix (combining the first and last 4 bytes of hash64). {noformat} 0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0]) as hash64, castInt(hash64(columns[0])) as hash64_downcast, hash32(columns[0]) as newhash from dfs.`/Users/asinha/data/sample.csv`; +---+--+--+--+ |id |hash64| hash64_downcast | newhash| +---+--+--+--+ | 1a883d005e0ce003b918d737ac697e7c | 6695077304582944118 | 594687350| 2140898336 | | e4b4388e8865819126cb0e4dcaa7261d | 2614721709087477964 | -2136387380 | -1528820922 | | 639a06fb09c70cc397666d38a8134af5 | 3943910117127083836 | 359520060| 601244263| | ae03f853f40c307aa24894e414a6dfdc | 4320987148691340574 | 214334750| 925976565| | 2dd3fdace36431e3810437bee1c7e3f1 | 5657579594883017754 | -1719653350 | -687608144 | | 00abdb137380e6ea8cb3e67df40c30dd | 5039129256017100358 | 573406790| 1740892954 | | d65d4e30ec96a588e82847aca619e4a0 | 550451582126160076 | 716077260| 755884032| | 956f968866b3151ad472edfcafb579fa | 39366413145792912| 1336074640 | 1328101915 | | 75577f830d12c86fd1de94d45cfa0715 | 6480730101791620276 | -226984780 | -1417129724 | | 298aa703dbee9e5f303372fe7a764975 | 7844015280248941602 | -2013696990 | -350034316 | +---+--+--+--+ 10 rows selected (0.228 seconds) {noformat} A key observation is that all hash64 values are even numbers. This is not a good thing. I confirmed the behavior over a larger sample of 100 rows. However, this seems specific to strings that are 32 chars (or maybe longer, although a simple test for a 64 char string did not show the same pattern). I then modified the seed value to 1 (default is 0). This time I got better distribution for the hash64: {noformat} 0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0], 1) as hash64, castInt(hash64(columns[0], 1)) as hash64_downcast, hash32(columns[0], 1) as newhash from dfs.`/Users/asinha/data/sample.csv`; +---+--+--+-+ |id |hash64| hash64_downcast | newhash | +---+--+--+-+ | 1a883d005e0ce003b918d737ac697e7c | 3877569168361489241 | 1211204441 | 2113824708 | | e4b4388e8865819126cb0e4dcaa7261d | 510472474498931 | 567154547| 1826042916 | | 639a06fb09c70cc397666d38a8134af5 | 6160367672898924663 | 1827713143 | 965653941 | | ae03f853f40c307aa24894e414a6dfdc | 5573714012720216212 | 533608596| 1385691081 | | 2dd3fdace36431e3810437bee1c7e3f1 | 4742615352245986962 | 284141202| 1363050779 | | 00abdb137380e6ea8cb3e67df40c30dd | 5870154798330275502 | 185067182| 1517362206 | | d65d4e30ec96a588e82847aca619e4a0 | 5469776233948339425 | 828202209| 2058735712 | | 956f968866b3151ad472edfcafb579fa | 8671446365158603789 | -1675527155 | -462006645 | | 75577f830d12c86fd1de94d45cfa0715 | 3369914886384026207 | 238584415| 553440739 | | 298aa703dbee9e5f303372fe7a764975 | 3765901389360033496 | 1811181272 | 1605846404 | +---+--+--+-+ 10 rows selected (0.263 seconds) {noformat} I am thinking we should put in the proposed fix I sent earlier since it improves things. Separately, I think we need to investigate the quality of the XXHash.hash64 implementation. BTW, I also downloaded the original XXHash's C implementation and based on an initial analysis that one produces different hash value than our implementation and does not seem to have the same 'even number' pattern. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022700#comment-15022700 ] Aman Sinha commented on DRILL-4119: --- As discussed in a prior comment, I have created DRILL-4122 for adding new unit tests for hashing. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Mehant Baid > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022404#comment-15022404 ] Aman Sinha commented on DRILL-4119: --- Submitted a PR. [~mehant] could you pls review ? The main change is in XXHash.java. Rest are supporting changes. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022713#comment-15022713 ] ASF GitHub Bot commented on DRILL-4119: --- Github user mehant commented on a diff in the pull request: https://github.com/apache/drill/pull/279#discussion_r45643009 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Hash32AsDouble.java --- @@ -0,0 +1,340 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.expr.fn.impl; + +import org.apache.drill.exec.expr.DrillSimpleFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope; +import org.apache.drill.exec.expr.holders.BigIntHolder; +import org.apache.drill.exec.expr.holders.Decimal18Holder; +import org.apache.drill.exec.expr.holders.Decimal28SparseHolder; +import org.apache.drill.exec.expr.holders.Decimal38SparseHolder; +import org.apache.drill.exec.expr.holders.Decimal9Holder; +import org.apache.drill.exec.expr.holders.Float4Holder; +import org.apache.drill.exec.expr.holders.Float8Holder; +import org.apache.drill.exec.expr.holders.IntHolder; +import org.apache.drill.exec.expr.holders.NullableBigIntHolder; +import org.apache.drill.exec.expr.holders.NullableDecimal18Holder; +import org.apache.drill.exec.expr.holders.NullableDecimal28SparseHolder; +import org.apache.drill.exec.expr.holders.NullableDecimal38SparseHolder; +import org.apache.drill.exec.expr.holders.NullableDecimal9Holder; +import org.apache.drill.exec.expr.holders.NullableFloat4Holder; +import org.apache.drill.exec.expr.holders.NullableFloat8Holder; +import org.apache.drill.exec.expr.holders.NullableIntHolder; + +/* + * Class contains hash function definitions for different data types. + * + * NOTE: These functions are used internally by Drill to perform hash distribution and in hash join. For + * numeric data types we would like to apply implicit casts in the join method however for this to work + * as expected we would need to hash the same value represented in different data types (int, bigint, float etc) + * to hash to the same node, this is why we cast all numeric values to double before performing the actual hash. + */ +public class Hash32AsDouble { + @FunctionTemplate(name = "hash32AsDouble", scope = FunctionScope.SIMPLE, nulls = FunctionTemplate.NullHandling.INTERNAL) + public static class NullableFloatHash implements DrillSimpleFunc { + +@Param +NullableFloat4Holder in; +@Output +IntHolder out; + +public void setup() { +} + +public void eval() { + if (in.isSet == 0) { +out.value = 0; + } else { +out.value = org.apache.drill.exec.expr.fn.impl.XXHash.hash32((double) in.value, 0); + } +} + } + + @FunctionTemplate(name = "hash32AsDouble", scope = FunctionScope.SIMPLE, nulls = FunctionTemplate.NullHandling.INTERNAL) + public static class FloatHash implements DrillSimpleFunc { + +@Param +Float4Holder in; +@Output +IntHolder out; + +public void setup() { +} + +public void eval() { + out.value = org.apache.drill.exec.expr.fn.impl.XXHash.hash32((double) in.value, 0); +} + } + + @FunctionTemplate(name = "hash32AsDouble", scope = FunctionScope.SIMPLE, nulls = FunctionTemplate.NullHandling.INTERNAL) + public static class NullableDoubleHash implements DrillSimpleFunc { + +@Param +NullableFloat8Holder in; +@Output +IntHolder out; + +public void setup() { +} + +public void eval() { + if (in.isSet == 0) { +out.value = 0; + } else
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022389#comment-15022389 ] ASF GitHub Bot commented on DRILL-4119: --- GitHub user amansinha100 opened a pull request: https://github.com/apache/drill/pull/279 DRILL-4119: Modify hash32 functions to combine the msb and lsb bytes … …of a 64-bit hash value (previously, we were casting to integer). - Use this new set of functions (for all data types) for creating the hash values needed for hash distribution, hash joins etc. - Rename HashFunctions to Hash32Functions to be consistent with the Hash64 counterpart. - Many data types did not have a hash32AsDouble equivalent...added these. - Add hash32 functions with seed. - Fix unit tests, add "hash" as a synonym for "hash32". You can merge this pull request into a Git repository by running: $ git pull https://github.com/amansinha100/incubator-drill hashfunc2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/279.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #279 commit 113d8adf70ffd359c747a5076da30e0c6a82f39b Author: Aman SinhaDate: 2015-11-21T18:23:52Z DRILL-4119: Modify hash32 functions to combine the msb and lsb bytes of a 64-bit hash value (previously, we were casting to integer). - Use this new set of functions (for all data types) for creating the hash values needed for hash distribution, hash joins etc. - Rename HashFunctions to Hash32Functions to be consistent with the Hash64 counterpart. - Many data types did not have a hash32AsDouble equivalent...added these. - Add hash32 functions with seed. - Fix unit tests, add "hash" as a synonym for "hash32". > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020627#comment-15020627 ] Aman Sinha commented on DRILL-4119: --- The problem comes from the cast to integer after computing the hash value: castInt(hash64AsDouble($0)). I verified that the hash64AsDouble produces good distribution for the hash value but the cast loses the precision. The hash-based operators all use a 32 bit hash value (for smaller memory footprint and related reasons), so we do need the integer value but should preserve as much as possible the underlying distribution. I am fixing this by ensuring that instead of casting to int, the underlying hash function itself computes a 32 bit hash value by first computing the 64 bit hash followed by XORing the most significant 4 bytes with the least significant 4 bytes. The current hash32 functions in XXHash.java (for example, see https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/XXHash.java#L198) are currently calling hash64 and then casting to int. I am proposing to change these to use the above mechanism of combining the msb and lsb bytes. The cpu cost should be relatively small. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020672#comment-15020672 ] Jacques Nadeau commented on DRILL-4119: --- Sounds good. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020638#comment-15020638 ] Jacques Nadeau commented on DRILL-4119: --- Interesting finding. As we've been stung by issues around hash functions before, it seems like we need to have a hash distribution test suite, especially when we make these kinds of changes. Each time we have an issue, then we can add that to the suite. I know one of the issues we had before was hashing null with another value (which we fixed with chaining). I can't remember what other issues we've had. Your proposal seems reasonable. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020665#comment-15020665 ] Aman Sinha commented on DRILL-4119: --- Yes, it would be useful to have a suite for the hashing. The number of combinations is large: num_data_types x nullability x num_hash_function_types (32bit, 64bit, AsDouble variations). Plus, the nature of the data itself - we need real world data for testing the quality of the distribution. Let me see if I can at least have a minimal test suite with some sample of the above combinations. I may end up creating a separate JIRA. > Skew in hash distribution for varchar (and possibly other) types of data > > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.3.0 >Reporter: Aman Sinha >Assignee: Aman Sinha > > We are seeing substantial skew for an Id column that contains varchar data of > length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01UnorderedMuxExchange > 03-01 Project(SomeId=[$0], > E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)