On Fri, Aug 25, 2023 at 6:34 PM Thomas Dullien <thomas.dull...@elastic.co.invalid> wrote: > apologies if the chart is incorrect.
The chart isn't necessarily incorrect, but it probably isn't the most relevant statistic here. "Lies, damn lies, and statistics" ;-) The average length of unique English words is not the same as the average word length in an English corpus. > Now, the test vectors in that pastebin do not match either the output of pre-change Lucene's murmur3, nor the output of the Python mmh3 package. That said, the pre-change Lucene and the mmh3 package agree, just not with the published list. Interesting. FWIW, I created my own test vectors by running the original murmur code and hashing the hash of a string (to make sure some high bits were set) and also testing different offsets to make sure chunking didn't change the hash values: https://github.com/yonik/java_util/blob/master/test/util/hash/TestMurmurHash3.java Anyway, we shouldn't let improvements be lost in the noise... if only some benchmarks show improvement (with others being indifferent), then it seems like it would be a good change. Occasionally this requires adding a new benchmark that has different bottlenecks (and hence can highlight the changes.) -Yonik