[ https://issues.apache.org/jira/browse/DRILL-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15084589#comment-15084589 ]
Jacques Nadeau commented on DRILL-4237: --------------------------------------- Just to clarify, I'm in support of moving back to murmur but think we'll need to re-implement the functions as I believe the old functions had a number of other implementation issues. > Skew in hash distribution > ------------------------- > > Key: DRILL-4237 > URL: https://issues.apache.org/jira/browse/DRILL-4237 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 1.4.0 > Reporter: Aman Sinha > > Apparently, the fix in DRILL-4119 did not fully resolve the data skew issue. > It worked fine on the smaller sample of the data set but on another sample of > the same data set, it still produces skewed values - see below the hash > values which are all odd numbers. > {noformat} > 0: jdbc:drill:zk=local> select columns[0], hash32(columns[0]) from `test.csv` > limit 10; > +-----------------------------------+--------------+ > | EXPR$0 | EXPR$1 | > +-----------------------------------+--------------+ > | f71aaddec3316ae18d43cb1467e88a41 | 1506011089 | > | 3f3a13bb45618542b5ac9d9536704d3a | 1105719049 | > | 6935afd0c693c67bba482cedb7a2919b | -18137557 | > | ca2a938d6d7e57bda40501578f98c2a8 | -1372666789 | > | fab7f08402c8836563b0a5c94dbf0aec | -1930778239 | > | 9eb4620dcb68a84d17209da279236431 | -970026001 | > | 16eed4a4e801b98550b4ff504242961e | 356133757 | > | a46f7935fea578ce61d8dd45bfbc2b3d | -94010449 | > | 7fdf5344536080c15deb2b5a2975a2b7 | -141361507 | > | b82560a06e2e51b461c9fe134a8211bd | -375376717 | > +-----------------------------------+--------------+ > {noformat} > This indicates an underlying issue with the XXHash64 java implementation, > which is Drill's implementation of the C version. One of the key difference > as pointed out by [~jnadeau] was the use of unsigned int64 in the C version > compared to the Java version which uses (signed) long. I created an XXHash > version using com.google.common.primitives.UnsignedLong. However, > UnsignedLong does not have bit-wise operations that are needed for XXHash > such as rotateLeft(), XOR etc. One could write wrappers for these but at > this point, the question is: should we think of an alternative hash function > ? > The alternative approach could be the murmur hash for numeric data types that > we were using earlier and the Mahout version of hash function for string > types > (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/HashHelper.java#L28). > As a test, I reverted to this function and was getting good hash > distribution for the test data. > I could not find any performance comparisons of our perf tests (TPC-H or DS) > with the original and newer (XXHash) hash functions. If performance is > comparable, should we revert to the original function ? > As an aside, I would like to remove the hash64 versions of the functions > since these are not used anywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332)