[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-30 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032062#comment-15032062
 ] 

Aman Sinha commented on DRILL-4119:
---

[~mehant] would it make sense to open a separate JIRA for the underlying 
XXHash.hash64 implementation ?  I feel that for hash32, we would still want to 
avoid down casting and instead use the mixing as proposed in this JIRA.  If you 
agree, I can merge in my patch. 

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-30 Thread Mehant Baid (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032229#comment-15032229
 ] 

Mehant Baid commented on DRILL-4119:


I think it makes sense to address that as a separate issue. Patch looks good 
otherwise. +1. 

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032791#comment-15032791
 ] 

ASF GitHub Bot commented on DRILL-4119:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/279


> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024913#comment-15024913
 ] 

Aman Sinha commented on DRILL-4119:
---

Our hash64 implementation looks similar to the original one but I haven't done 
enough analysis to say they are exactly the same.  The only way to check is 
through testing.  Here are 2 values and their corresponding hash from the 
original (note, for some reason the command line utility xxh64sum does not read 
multiple lines from a file, so I had to break up the values into separate 
files): 
{noformat}
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat > sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample1.csv
1a883d005e0ce003b918d737ac697e7c
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample1.csv
1213a50f060e0659  sample1.csv
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample2.csv
e0658433041ce9aa  sample2.csv
{noformat}

These values don't match the value I am getting from Drill  after doing the 
conversion of the long to hex (I used Long.toHexString() method in debugger to 
convert), so it is possible something may have gotten lost in translation. 

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Mehant Baid (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024951#comment-15024951
 ] 

Mehant Baid commented on DRILL-4119:


If we are returning different values from the original implementation then I 
feel we should fix that issue? I can help out to identify the differences.

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025061#comment-15025061
 ] 

Aman Sinha commented on DRILL-4119:
---

Sure, if you want to try out the original version go for it...

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Zelaine Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025071#comment-15025071
 ] 

Zelaine Fong commented on DRILL-4119:
-

Per discussion at today's Drill hangout, Jacques mentioned that one of the 
differences resulting from the port is dealing with Java not supporting 
unsigneds.

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Zelaine Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024866#comment-15024866
 ] 

Zelaine Fong commented on DRILL-4119:
-

[~amansinha100] - when you say "the original XXHash's C implementation and 
based on an initial analysis that one produces different hash value than our 
implementation and does not seem to have the same 'even number' pattern", are 
you saying that our current hash64 implementation is different from the 
original one?  If yes, does that mean something got lost in the translation 
from C to Java?

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024802#comment-15024802
 ] 

Aman Sinha commented on DRILL-4119:
---

I did some more testing with the sample data.  Here are 3 hash values: 
 - hash64 is the native hash64 computed by XXHash.hash64()
 - hash64_downcast is the same value downcast to int
 - newhash is the the new 32 bit hash value computed by the proposed fix 
(combining the first and last 4 bytes of hash64).  

{noformat}
0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0]) as hash64, 
castInt(hash64(columns[0])) as hash64_downcast, hash32(columns[0]) as newhash 
from dfs.`/Users/asinha/data/sample.csv`;
+---+--+--+--+
|id |hash64| hash64_downcast  | 
  newhash|
+---+--+--+--+
| 1a883d005e0ce003b918d737ac697e7c  | 6695077304582944118  | 594687350| 
2140898336   |
| e4b4388e8865819126cb0e4dcaa7261d  | 2614721709087477964  | -2136387380  | 
-1528820922  |
| 639a06fb09c70cc397666d38a8134af5  | 3943910117127083836  | 359520060| 
601244263|
| ae03f853f40c307aa24894e414a6dfdc  | 4320987148691340574  | 214334750| 
925976565|
| 2dd3fdace36431e3810437bee1c7e3f1  | 5657579594883017754  | -1719653350  | 
-687608144   |
| 00abdb137380e6ea8cb3e67df40c30dd  | 5039129256017100358  | 573406790| 
1740892954   |
| d65d4e30ec96a588e82847aca619e4a0  | 550451582126160076   | 716077260| 
755884032|
| 956f968866b3151ad472edfcafb579fa  | 39366413145792912| 1336074640   | 
1328101915   |
| 75577f830d12c86fd1de94d45cfa0715  | 6480730101791620276  | -226984780   | 
-1417129724  |
| 298aa703dbee9e5f303372fe7a764975  | 7844015280248941602  | -2013696990  | 
-350034316   |
+---+--+--+--+
10 rows selected (0.228 seconds)
{noformat}

A key observation is that all hash64 values are even numbers.   This is not a 
good thing.  I confirmed the behavior over a larger sample of 100 rows.  
However, this seems specific to strings that are 32 chars (or maybe longer, 
although a simple test for a 64 char string did not show the same pattern).  

I then modified the seed value to 1 (default is 0).  This time I got better 
distribution for the hash64:
{noformat}
0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0], 1) as 
hash64, castInt(hash64(columns[0], 1)) as hash64_downcast, hash32(columns[0], 
1) as newhash from dfs.`/Users/asinha/data/sample.csv`;
+---+--+--+-+
|id |hash64| hash64_downcast  | 
  newhash   |
+---+--+--+-+
| 1a883d005e0ce003b918d737ac697e7c  | 3877569168361489241  | 1211204441   | 
2113824708  |
| e4b4388e8865819126cb0e4dcaa7261d  | 510472474498931  | 567154547| 
1826042916  |
| 639a06fb09c70cc397666d38a8134af5  | 6160367672898924663  | 1827713143   | 
965653941   |
| ae03f853f40c307aa24894e414a6dfdc  | 5573714012720216212  | 533608596| 
1385691081  |
| 2dd3fdace36431e3810437bee1c7e3f1  | 4742615352245986962  | 284141202| 
1363050779  |
| 00abdb137380e6ea8cb3e67df40c30dd  | 5870154798330275502  | 185067182| 
1517362206  |
| d65d4e30ec96a588e82847aca619e4a0  | 5469776233948339425  | 828202209| 
2058735712  |
| 956f968866b3151ad472edfcafb579fa  | 8671446365158603789  | -1675527155  | 
-462006645  |
| 75577f830d12c86fd1de94d45cfa0715  | 3369914886384026207  | 238584415| 
553440739   |
| 298aa703dbee9e5f303372fe7a764975  | 3765901389360033496  | 1811181272   | 
1605846404  |
+---+--+--+-+
10 rows selected (0.263 seconds)
{noformat}

I am thinking we should put in the proposed fix I sent earlier since it 
improves things.  Separately, I think we need to investigate the quality of the 
XXHash.hash64 implementation.   BTW, I also downloaded the original XXHash's C 
implementation and based on an initial analysis that one produces different 
hash value than our implementation and does not seem to have the same 'even 
number' pattern. 

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>

[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-23 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022700#comment-15022700
 ] 

Aman Sinha commented on DRILL-4119:
---

As discussed in a prior comment, I have created DRILL-4122 for adding new unit 
tests for hashing. 

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Mehant Baid
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-23 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022404#comment-15022404
 ] 

Aman Sinha commented on DRILL-4119:
---

Submitted a PR.  [~mehant] could you pls review ?  The main change is in 
XXHash.java.  Rest are supporting changes. 

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022713#comment-15022713
 ] 

ASF GitHub Bot commented on DRILL-4119:
---

Github user mehant commented on a diff in the pull request:

https://github.com/apache/drill/pull/279#discussion_r45643009
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Hash32AsDouble.java
 ---
@@ -0,0 +1,340 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.impl;
+
+import org.apache.drill.exec.expr.DrillSimpleFunc;
+import org.apache.drill.exec.expr.annotations.FunctionTemplate;
+import org.apache.drill.exec.expr.annotations.Output;
+import org.apache.drill.exec.expr.annotations.Param;
+import 
org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.Decimal18Holder;
+import org.apache.drill.exec.expr.holders.Decimal28SparseHolder;
+import org.apache.drill.exec.expr.holders.Decimal38SparseHolder;
+import org.apache.drill.exec.expr.holders.Decimal9Holder;
+import org.apache.drill.exec.expr.holders.Float4Holder;
+import org.apache.drill.exec.expr.holders.Float8Holder;
+import org.apache.drill.exec.expr.holders.IntHolder;
+import org.apache.drill.exec.expr.holders.NullableBigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableDecimal18Holder;
+import org.apache.drill.exec.expr.holders.NullableDecimal28SparseHolder;
+import org.apache.drill.exec.expr.holders.NullableDecimal38SparseHolder;
+import org.apache.drill.exec.expr.holders.NullableDecimal9Holder;
+import org.apache.drill.exec.expr.holders.NullableFloat4Holder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.NullableIntHolder;
+
+/*
+ * Class contains hash function definitions for different data types.
+ *
+ * NOTE: These functions are used internally by Drill to perform hash 
distribution and in hash join. For
+ * numeric data types we would like to apply implicit casts in the join 
method however for this to work
+ * as expected we would need to hash the same value represented in 
different data types (int, bigint, float etc)
+ * to hash to the same node, this is why we cast all numeric values to 
double before performing the actual hash.
+ */
+public class Hash32AsDouble {
+  @FunctionTemplate(name = "hash32AsDouble", scope = FunctionScope.SIMPLE, 
nulls = FunctionTemplate.NullHandling.INTERNAL)
+  public static class NullableFloatHash implements DrillSimpleFunc {
+
+@Param
+NullableFloat4Holder in;
+@Output
+IntHolder out;
+
+public void setup() {
+}
+
+public void eval() {
+  if (in.isSet == 0) {
+out.value = 0;
+  } else {
+out.value = 
org.apache.drill.exec.expr.fn.impl.XXHash.hash32((double) in.value, 0);
+  }
+}
+  }
+
+  @FunctionTemplate(name = "hash32AsDouble", scope = FunctionScope.SIMPLE, 
nulls = FunctionTemplate.NullHandling.INTERNAL)
+  public static class FloatHash implements DrillSimpleFunc {
+
+@Param
+Float4Holder in;
+@Output
+IntHolder out;
+
+public void setup() {
+}
+
+public void eval() {
+  out.value = 
org.apache.drill.exec.expr.fn.impl.XXHash.hash32((double) in.value, 0);
+}
+  }
+
+  @FunctionTemplate(name = "hash32AsDouble", scope = FunctionScope.SIMPLE, 
nulls = FunctionTemplate.NullHandling.INTERNAL)
+  public static class NullableDoubleHash implements DrillSimpleFunc {
+
+@Param
+NullableFloat8Holder in;
+@Output
+IntHolder out;
+
+public void setup() {
+}
+
+public void eval() {
+  if (in.isSet == 0) {
+out.value = 0;
+  } else 

[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022389#comment-15022389
 ] 

ASF GitHub Bot commented on DRILL-4119:
---

GitHub user amansinha100 opened a pull request:

https://github.com/apache/drill/pull/279

DRILL-4119: Modify hash32 functions to combine the msb and lsb bytes …

…of a 64-bit hash value (previously, we were casting to integer).

 - Use this new set of functions (for all data types) for creating the hash 
values needed for hash distribution, hash joins etc.
 - Rename HashFunctions to Hash32Functions to be consistent with the Hash64 
counterpart.
 - Many data types did not have a hash32AsDouble equivalent...added these.
 - Add hash32 functions with seed.
 - Fix unit tests, add "hash" as a synonym for "hash32".

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/amansinha100/incubator-drill hashfunc2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #279


commit 113d8adf70ffd359c747a5076da30e0c6a82f39b
Author: Aman Sinha 
Date:   2015-11-21T18:23:52Z

DRILL-4119: Modify hash32 functions to combine the msb and lsb bytes of a 
64-bit hash value (previously, we were casting to integer).
 - Use this new set of functions (for all data types) for creating the hash 
values needed for hash distribution, hash joins etc.
 - Rename HashFunctions to Hash32Functions to be consistent with the Hash64 
counterpart.
 - Many data types did not have a hash32AsDouble equivalent...added these.
 - Add hash32 functions with seed.
 - Fix unit tests, add "hash" as a synonym for "hash32".




> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-21 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020627#comment-15020627
 ] 

Aman Sinha commented on DRILL-4119:
---

The problem comes from the cast to integer after computing the hash value:  
castInt(hash64AsDouble($0)).   I verified that the hash64AsDouble produces good 
distribution for the hash value but the cast loses the precision.   The 
hash-based operators all use a 32 bit hash value (for smaller memory footprint 
and related reasons), so we do need the integer value but should preserve as 
much as possible the underlying distribution. 

I am fixing this by ensuring that instead of casting to int,  the underlying 
hash function itself computes a 32 bit hash value by first computing the 64 bit 
hash followed by XORing the most significant 4 bytes with the least significant 
4 bytes.   The current  hash32 functions in XXHash.java (for example, see 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/XXHash.java#L198)
 are currently calling hash64 and then casting to int.  I am proposing to 
change these to use the above mechanism of combining the msb and lsb bytes.  
The cpu cost should be relatively small.  

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-21 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020672#comment-15020672
 ] 

Jacques Nadeau commented on DRILL-4119:
---

Sounds good.

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-21 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020638#comment-15020638
 ] 

Jacques Nadeau commented on DRILL-4119:
---

Interesting finding. As we've been stung by issues around hash functions 
before, it seems like we need to have a hash distribution test suite, 
especially when we make these kinds of changes. Each time we have an issue, 
then we can add that to the suite. I know one of the issues we had before was 
hashing null with another value (which we fixed with chaining). I can't 
remember what other issues we've had.

Your proposal seems reasonable.

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-21 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020665#comment-15020665
 ] 

Aman Sinha commented on DRILL-4119:
---

Yes, it would be useful to have a suite for the hashing.  The number of 
combinations is large:  num_data_types x nullability x num_hash_function_types 
(32bit, 64bit, AsDouble variations).  Plus, the nature of the data itself - we 
need real world data for testing the quality of the distribution.  Let me see 
if I can at least have a minimal test suite with some sample of the above 
combinations.   I may end up creating a separate JIRA.

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)