[
https://issues.apache.org/jira/browse/DRILL-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178468#comment-16178468
]
Sorabh Hamirwasia commented on DRILL-5816:
------------------------------------------
I ran the test with above data and getting hash64 (MSB and LSB of 128 bit hash
code )instead of int casted hash64 (as done today). In this case distribution
looks better. So looks like 64bit has code doesn't have that bad of an issue.
But if you compare the LSB and MSB 64bit then LSB still has better
distribution. So that does point that MSB bits are less randomly generated at
least for this case as compared to LSB bits.
But given hash32 is showing better distribution without much performance
difference. I still think we should use hash32 implementation for getting
hash32 codes.
Running
org.apache.drill.exec.expr.fn.impl.TestMurmur3HashFunction#testHash64WithDrillBufIssueData
=========================================================================
Results for test: testHash64WithDrillBufIssueData
Distribution for Murmur32: 18
Distribution for Murmur64_MSB: 10
Distribution for Murmur64_LSB: 19
Stats for Murmur32 collision count: Min: 1, Median: 1, Max: 2
Stats for Murmur64_MSB collision count: Min: 1, Median: 1, Max: 4
Stats for Murmur64_LSB collision count: Min: 1, Median: 1, Max: 2
Time for Murmur32 - Avg Time: 150 and Max Time: 1000
Time for Murmur64_MSB - Avg Time: 200 and Max Time: 2000
Time for Murmur64_LSB - Avg Time: 100 and Max Time: 1000
Original distribution with int casted output of hash64:
Running
org.apache.drill.exec.expr.fn.impl.TestMurmur3HashFunction#testHash32WithDrillBufIssueData
— Used the issue data shared in JIRA
=========================================================================
Results for test: testHash32WithDrillBufIssueData
Distribution for Murmur32: 18
Distribution for Murmur64_MSB: 1
Distribution for Murmur64_LSB: 19
Stats for Murmur32 collision count: Min: 1, Median: 1, Max: 2
Stats for Murmur64_MSB collision count: Min: 20, Median: 20, Max: 20
Stats for Murmur64_LSB collision count: Min: 1, Median: 1, Max: 2
Time for Murmur32 - Avg Time: 300 and Max Time: 2000
Time for Murmur64_MSB - Avg Time: 250 and Max Time: 3000
Time for Murmur64_LSB - Avg Time: 50 and Max Time: 1000
=======================================
I ran few tests for the HashFunction to compute 32 bit HashCode. Today we use
Hash128 internally and get higher 64 bits and then int cast it for Hash32 hash
codes. There is an alternate implementation to directly compute the Hash32
codes as well. So my test compare Hash32 / int casted MSB's out of Hash128 and
int casted LSB's out of Hash128. I don't see any issue with Hash32 and it's
performance is also similar or better in some cases compared to Hash64 to get
32bit hash codes. Also based on [this|http://yonik.com/murmurhash3-for-java/]
blog it recommends that for 32bit hash code we should use Murmur3 Hash32
implementation and for others we should use 128bit implementation.
_It looks there were vulnerabilities in MurmurHash2 which was resolved in
MurmurHash3 implementations:
Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in
32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64
version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2._
Test Setup 1 with 100K data points:
Test Data Size: 100000, Prefix: mscId=100, Modulo: 56, MaxWordSize: 18
Data Set Sizes:
Random Words: 100000, — Unique Random Strings of size 18
RandomNumbers: 100000 — Unique Random Long between 9999 and MAX_LONG_VALUE
RandomWordsWithPrefix: 100000 — Unique Random String with prefix mscId=100 and
of size 18
Issue Like Data without and with prefix and of size 18:
Fixed Length Words: 100000, FixedLengthWord With FixedPrefix: 99942
=================================================================================
Running
org.apache.drill.exec.expr.fn.impl.TestMurmur3HashFunction#testHash32WithDrillBufRandomLong
=========================================================================
Results for test: testHash32WithDrillBufRandomLong
Distribution for Murmur32: 111
Distribution for Murmur64_MSB: 111
Distribution for Murmur64_LSB: 111
[Sorabh] - Above Distribution show that how many unique values were found when
modulo (in this case using 56) was performed on generated hash codes. So we can
see that here all the hashcodes were distributed over 111 values which is (56*2
- 1). So we can think of it as 111 buckets in hashtable were filled. Murmur32
is computing hash32 codes using hash32 function directly, Murmur64_MSB is
computing hash32 codes by taking MSB 64bits of 128bit hash code and then int
casting it, Murmur64_LSB is computing hash32 codes by taking LSB 64bits of
128bit hash code and then int casting it.
Stats for Murmur32 collision count: Min: 822, Median: 892, Max: 1813
Stats for Murmur64_MSB collision count: Min: 811, Median: 894, Max: 1796
Stats for Murmur64_LSB collision count: Min: 822, Median: 892, Max: 1784
[Sorabh] - Above Stats show that how many collision values were found for each
bucket. Min is minimum count of collision among all the buckets and median/max
is for same stats.
Time for Murmur32 - Avg Time: 43 and Max Time: 47000
Time for Murmur64_MSB - Avg Time: 40 and Max Time: 19000
Time for Murmur64_LSB - Avg Time: 41 and Max Time: 41000
[Sorabh] - This shows the performance of each hash functions for computing hash
on all the data. Unit is nanoseconds here.
Running
org.apache.drill.exec.expr.fn.impl.TestMurmur3HashFunction#testHash32WithDrillBufFixedPrefixString
=========================================================================
Results for test: testHash32WithDrillBufFixedPrefixString
Distribution for Murmur32: 111
Distribution for Murmur64_MSB: 111
Distribution for Murmur64_LSB: 111
Stats for Murmur32 collision count: Min: 813, Median: 894, Max: 1737
Stats for Murmur64_MSB collision count: Min: 808, Median: 890, Max: 1856
Stats for Murmur64_LSB collision count: Min: 790, Median: 894, Max: 1774
Time for Murmur32 - Avg Time: 92 and Max Time: 9000
Time for Murmur64_MSB - Avg Time: 93 and Max Time: 16000
Time for Murmur64_LSB - Avg Time: 99 and Max Time: 231000
Running
org.apache.drill.exec.expr.fn.impl.TestMurmur3HashFunction#testHash32WithDrillBufRandomString
=========================================================================
Results for test: testHash32WithDrillBufRandomString
Distribution for Murmur32: 111
Distribution for Murmur64_MSB: 111
Distribution for Murmur64_LSB: 111
Stats for Murmur32 collision count: Min: 838, Median: 893, Max: 1798
Stats for Murmur64_MSB collision count: Min: 816, Median: 891, Max: 1878
Stats for Murmur64_LSB collision count: Min: 824, Median: 890, Max: 1799
Time for Murmur32 - Avg Time: 61 and Max Time: 49000
Time for Murmur64_MSB - Avg Time: 57 and Max Time: 49000
Time for Murmur64_LSB - Avg Time: 55 and Max Time: 14000
Running
org.apache.drill.exec.expr.fn.impl.TestMurmur3HashFunction#testHash32WithDrillBufIssueData
— Used the issue data shared in JIRA
=========================================================================
Results for test: testHash32WithDrillBufIssueData
Distribution for Murmur32: 18
Distribution for Murmur64_MSB: 1
Distribution for Murmur64_LSB: 19
Stats for Murmur32 collision count: Min: 1, Median: 1, Max: 2
Stats for Murmur64_MSB collision count: Min: 20, Median: 20, Max: 20
Stats for Murmur64_LSB collision count: Min: 1, Median: 1, Max: 2
Time for Murmur32 - Avg Time: 150 and Max Time: 1000
Time for Murmur64_MSB - Avg Time: 150 and Max Time: 3000
Time for Murmur64_LSB - Avg Time: 250 and Max Time: 1000
Running
org.apache.drill.exec.expr.fn.impl.TestMurmur3HashFunction#testHash32WithDrillBufFixedLengthStringFixedPrefix
=========================================================================
Results for test: testHash32WithDrillBufFixedLengthStringFixedPrefix
Distribution for Murmur32: 111
Distribution for Murmur64_MSB: 111
Distribution for Murmur64_LSB: 111
Stats for Murmur32 collision count: Min: 826, Median: 894, Max: 1740
Stats for Murmur64_MSB collision count: Min: 809, Median: 894, Max: 1795
Stats for Murmur64_LSB collision count: Min: 828, Median: 895, Max: 1756
Time for Murmur32 - Avg Time: 61 and Max Time: 119000
Time for Murmur64_MSB - Avg Time: 51 and Max Time: 3000
Time for Murmur64_LSB - Avg Time: 57 and Max Time: 68000
Running
org.apache.drill.exec.expr.fn.impl.TestMurmur3HashFunction#testHash32WithDrillBufFixedLengthString
=========================================================================
Results for test: testHash32WithDrillBufFixedLengthString
Distribution for Murmur32: 111
Distribution for Murmur64_MSB: 111
Distribution for Murmur64_LSB: 111
Stats for Murmur32 collision count: Min: 813, Median: 897, Max: 1723
Stats for Murmur64_MSB collision count: Min: 814, Median: 893, Max: 1745
Stats for Murmur64_LSB collision count: Min: 803, Median: 894, Max: 1764
Time for Murmur32 - Avg Time: 63 and Max Time: 152000
Time for Murmur64_MSB - Avg Time: 53 and Max Time: 11000
Time for Murmur64_LSB - Avg Time: 104 and Max Time: 1306000
> Hash function produces skewed results on String values with same leading
> prefix
> -------------------------------------------------------------------------------
>
> Key: DRILL-5816
> URL: https://issues.apache.org/jira/browse/DRILL-5816
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Sorabh Hamirwasia
> Assignee: Sorabh Hamirwasia
> Fix For: 1.12.0
>
>
> Reported by [~amansinha100]
> Hashing of string values (for the hash exchange) could produce substantial
> skew for certain types of strings that have the same leading prefix.
> Here's the sample data: (note all strings begin with 'mscId=' followed by
> numeric values)
> 0: jdbc:drill:drillbit=10.10.103.111> select a from dfs.tmp.vv3 limit 20;
> +---------------------+
> | a |
> +---------------------+
> | mscId=100139170495 |
> | mscId=100103806655 |
> | mscId=100229137840 |
> | mscId=100362859440 |
> | mscId=100032583600 |
> | mscId=100125021360 |
> | mscId=100243775920 |
> | mscId=100152820405 |
> | mscId=100084724405 |
> | mscId=100297398970 |
> | mscId=100059560890 |
> | mscId=100106108090 |
> | mscId=100032092090 |
> | mscId=100029460410 |
> | mscId=100110390995 |
> | mscId=100019105235 |
> | mscId=100354644435 |
> | mscId=100288523475 |
> | mscId=100214507475 |
> | mscId=100296418515 |
> +---------------------+
> 20 rows selected (0.33 seconds)
> Here's the hash values using the hash function that Drill uses for the
> HashToRandomExchange (note that they are all even numbers):
> 0: jdbc:drill:drillbit=10.10.103.111> select hash32AsDouble(a, 1301011) from
> dfs.tmp.vv3 limit 20;
> +--------------+
> | EXPR$0 |
> +--------------+
> | 1180062632 |
> | -1322734784 |
> | 2096701320 |
> | 2075007536 |
> | -1970336592 |
> | 1614574192 |
> | 1592743936 |
> | -1053691072 |
> | -689805200 |
> | 1893061072 |
> | 1660328376 |
> | 1852126136 |
> | 1927731344 |
> | 616840056 |
> | -1997249184 |
> | 1588717872 |
> | 193019624 |
> | 880839008 |
> | 1879415496 |
> | 1726850216 |
> +--------------+
> 20 rows selected (0.311 seconds)
> Doing a mod 56 only produces 1 distinct value, which indicates the skew:
> 0: jdbc:drill:drillbit=10.10.103.111> select distinct mod(hash32AsDouble(a,
> 1301011), 56) from dfs.tmp.vv3 limit 20;
> +---------+
> | EXPR$0 |
> +---------+
> | 0 |
> +---------+
> 1 row selected (1.041 seconds)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)