On Fri, Aug 25, 2023 at 6:34 PM Thomas Dullien
<thomas.dull...@elastic.co.invalid> wrote:
> apologies if the chart is incorrect.

The chart isn't necessarily incorrect, but it probably isn't the
most relevant statistic here. "Lies, damn lies, and statistics" ;-)
The average length of unique English words is not the same as the average
word length in an English corpus.

> Now, the test vectors in that pastebin do not match either the output of
pre-change Lucene's murmur3, nor the output of the Python mmh3 package.
That said, the pre-change Lucene and the mmh3 package agree, just not with
the published list.

Interesting. FWIW, I created my own test vectors by running the original
murmur code and hashing the hash of a string (to make sure some high bits
were set) and also testing different offsets to make sure chunking didn't
change the hash values:
https://github.com/yonik/java_util/blob/master/test/util/hash/TestMurmurHash3.java

Anyway, we shouldn't let improvements be lost in the noise... if only some
benchmarks show improvement (with others being indifferent), then it seems
like it would be a good change.  Occasionally this requires adding a new
benchmark that has different bottlenecks (and hence can highlight the
changes.)

-Yonik

Reply via email to