[ 
https://issues.apache.org/jira/browse/DRILL-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178472#comment-16178472
 ] 

ASF GitHub Bot commented on DRILL-5816:
---------------------------------------

GitHub user sohami opened a pull request:

    https://github.com/apache/drill/pull/959

    DRILL-5816: Hash function produces skewed results on String values wi…

    …th same leading prefix
    
                Note: Changing hash32 computation to use Murmur3.hash32 instead 
of int casted version of Murmur3.hash64

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sohami/drill DRILL-5816

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/959.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #959
    
----
commit 4b9dd5be778307138e5fc60041232c66c6671d75
Author: Sorabh Hamirwasia <[email protected]>
Date:   2017-09-15T22:07:50Z

    DRILL-5816: Hash function produces skewed results on String values with 
same leading prefix
                Note: Changing hash32 computation to use Murmur3.hash32 instead 
of int casted version of Murmur3.hash64

----


> Hash function produces skewed results on String values with same leading 
> prefix
> -------------------------------------------------------------------------------
>
>                 Key: DRILL-5816
>                 URL: https://issues.apache.org/jira/browse/DRILL-5816
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Sorabh Hamirwasia
>            Assignee: Sorabh Hamirwasia
>             Fix For: 1.12.0
>
>
> Reported by [~amansinha100]
> Hashing of string values (for the hash exchange) could produce substantial 
> skew for certain types of strings that have the same leading prefix.
> Here's the sample data: (note all strings begin with 'mscId=' followed by 
> numeric values)
> 0: jdbc:drill:drillbit=10.10.103.111> select a from dfs.tmp.vv3 limit 20;
> +---------------------+
> |          a          |
> +---------------------+
> | mscId=100139170495  |
> | mscId=100103806655  |
> | mscId=100229137840  |
> | mscId=100362859440  |
> | mscId=100032583600  |
> | mscId=100125021360  |
> | mscId=100243775920  |
> | mscId=100152820405  |
> | mscId=100084724405  |
> | mscId=100297398970  |
> | mscId=100059560890  |
> | mscId=100106108090  |
> | mscId=100032092090  |
> | mscId=100029460410  |
> | mscId=100110390995  |
> | mscId=100019105235  |
> | mscId=100354644435  |
> | mscId=100288523475  |
> | mscId=100214507475  |
> | mscId=100296418515  |
> +---------------------+
> 20 rows selected (0.33 seconds)
> Here's the hash values using the hash function that Drill uses for the 
> HashToRandomExchange (note that they are all even numbers):
> 0: jdbc:drill:drillbit=10.10.103.111> select hash32AsDouble(a, 1301011) from 
> dfs.tmp.vv3 limit 20;
> +--------------+
> |    EXPR$0    |
> +--------------+
> | 1180062632   |
> | -1322734784  |
> | 2096701320   |
> | 2075007536   |
> | -1970336592  |
> | 1614574192   |
> | 1592743936   |
> | -1053691072  |
> | -689805200   |
> | 1893061072   |
> | 1660328376   |
> | 1852126136   |
> | 1927731344   |
> | 616840056    |
> | -1997249184  |
> | 1588717872   |
> | 193019624    |
> | 880839008    |
> | 1879415496   |
> | 1726850216   |
> +--------------+
> 20 rows selected (0.311 seconds)
> Doing a mod 56 only produces 1 distinct value, which indicates the skew:
> 0: jdbc:drill:drillbit=10.10.103.111> select distinct mod(hash32AsDouble(a, 
> 1301011), 56) from dfs.tmp.vv3 limit 20;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> 1 row selected (1.041 seconds)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to