[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024802#comment-15024802
 ] 

Aman Sinha commented on DRILL-4119:
-----------------------------------

I did some more testing with the sample data.  Here are 3 hash values: 
 - hash64 is the native hash64 computed by XXHash.hash64()
 - hash64_downcast is the same value downcast to int
 - newhash is the the new 32 bit hash value computed by the proposed fix 
(combining the first and last 4 bytes of hash64).  

{noformat}
0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0]) as hash64, 
castInt(hash64(columns[0])) as hash64_downcast, hash32(columns[0]) as newhash 
from dfs.`/Users/asinha/data/sample.csv`;
+-----------------------------------+----------------------+------------------+--------------+
|                id                 |        hash64        | hash64_downcast  | 
  newhash    |
+-----------------------------------+----------------------+------------------+--------------+
| 1a883d005e0ce003b918d737ac697e7c  | 6695077304582944118  | 594687350        | 
2140898336   |
| e4b4388e8865819126cb0e4dcaa7261d  | 2614721709087477964  | -2136387380      | 
-1528820922  |
| 639a06fb09c70cc397666d38a8134af5  | 3943910117127083836  | 359520060        | 
601244263    |
| ae03f853f40c307aa24894e414a6dfdc  | 4320987148691340574  | 214334750        | 
925976565    |
| 2dd3fdace36431e3810437bee1c7e3f1  | 5657579594883017754  | -1719653350      | 
-687608144   |
| 00abdb137380e6ea8cb3e67df40c30dd  | 5039129256017100358  | 573406790        | 
1740892954   |
| d65d4e30ec96a588e82847aca619e4a0  | 550451582126160076   | 716077260        | 
755884032    |
| 956f968866b3151ad472edfcafb579fa  | 39366413145792912    | 1336074640       | 
1328101915   |
| 75577f830d12c86fd1de94d45cfa0715  | 6480730101791620276  | -226984780       | 
-1417129724  |
| 298aa703dbee9e5f303372fe7a764975  | 7844015280248941602  | -2013696990      | 
-350034316   |
+-----------------------------------+----------------------+------------------+--------------+
10 rows selected (0.228 seconds)
{noformat}

A key observation is that all hash64 values are even numbers.   This is not a 
good thing.  I confirmed the behavior over a larger sample of 100 rows.  
However, this seems specific to strings that are 32 chars (or maybe longer, 
although a simple test for a 64 char string did not show the same pattern).  

I then modified the seed value to 1 (default is 0).  This time I got better 
distribution for the hash64:
{noformat}
0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0], 1) as 
hash64, castInt(hash64(columns[0], 1)) as hash64_downcast, hash32(columns[0], 
1) as newhash from dfs.`/Users/asinha/data/sample.csv`;
+-----------------------------------+----------------------+------------------+-------------+
|                id                 |        hash64        | hash64_downcast  | 
  newhash   |
+-----------------------------------+----------------------+------------------+-------------+
| 1a883d005e0ce003b918d737ac697e7c  | 3877569168361489241  | 1211204441       | 
2113824708  |
| e4b4388e8865819126cb0e4dcaa7261d  | 5555510472474498931  | 567154547        | 
1826042916  |
| 639a06fb09c70cc397666d38a8134af5  | 6160367672898924663  | 1827713143       | 
965653941   |
| ae03f853f40c307aa24894e414a6dfdc  | 5573714012720216212  | 533608596        | 
1385691081  |
| 2dd3fdace36431e3810437bee1c7e3f1  | 4742615352245986962  | 284141202        | 
1363050779  |
| 00abdb137380e6ea8cb3e67df40c30dd  | 5870154798330275502  | 185067182        | 
1517362206  |
| d65d4e30ec96a588e82847aca619e4a0  | 5469776233948339425  | 828202209        | 
2058735712  |
| 956f968866b3151ad472edfcafb579fa  | 8671446365158603789  | -1675527155      | 
-462006645  |
| 75577f830d12c86fd1de94d45cfa0715  | 3369914886384026207  | 238584415        | 
553440739   |
| 298aa703dbee9e5f303372fe7a764975  | 3765901389360033496  | 1811181272       | 
1605846404  |
+-----------------------------------+----------------------+------------------+-------------+
10 rows selected (0.263 seconds)
{noformat}

I am thinking we should put in the proposed fix I sent earlier since it 
improves things.  Separately, I think we need to investigate the quality of the 
XXHash.hash64 implementation.   BTW, I also downloaded the original XXHash's C 
implementation and based on an initial analysis that one produces different 
hash value than our implementation and does not seem to have the same 'even 
number' pattern. 

> Skew in hash distribution for varchar (and possibly other) types of data
> ------------------------------------------------------------------------
>
>                 Key: DRILL-4119
>                 URL: https://issues.apache.org/jira/browse/DRILL-4119
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.3.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02          HashAgg(group=[{0}])
> 01-03            Project(SomeId=[$0])
> 01-04              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02                    HashAgg(group=[{0}])
> 03-03                      Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to