Hey, ok, I've done some digging: Unfortunately, MurmurHash3 does not publish official test vectors, see the following URLs: https://github.com/aappleby/smhasher/issues/6 https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958 There is a link to a pastebin entry in the first issue, which leads to https://pastebin.com/kkggV9Vx
Now, the test vectors in that pastebin do not match either the output of pre-change Lucene's murmur3, nor the output of the Python mmh3 package. That said, the pre-change Lucene and the mmh3 package agree, just not with the published list. There *are* test vectors in the source code for the mmh3 python package, which I could use, or cook up a set of bespoke ones, or both (I share the concern about 8-byte boundaries and signedness). https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75 Cheers, Thomas On Tue, Apr 25, 2023 at 5:15 PM Robert Muir <rcm...@gmail.com> wrote: > i dont think we need a ton of random strings. But if you want to > optimize for strings of length 8, at a minimum there should be very > simple tests ensuring correctness for some boundary conditions (e.g. > string of length exactly 8). i would also strongly recommend testing > non-ascii since java is a language with signed integer types so it may > be susceptible to bugs where the input bytes have the "sign bit" set. > > IMO this could be 2 simple unit tests. > > usually at least with these kinds of algorithms you can also find > published "test vectors" that intend to seek out the corner cases. if > these exist for murmurhash, we should fold them in too. > > On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien > <thomas.dull...@elastic.co> wrote: > > > > Hey, > > > > I offered to run a large number of random-string-hashes to ensure that > the output is the same pre- and post-change. I can add an arbitrary number > of such tests to TestStringHelper.java, just specify the number you wish. > > > > If your worry is that my change breaches the inlining bytecode limit: > Did you check whether the old version was inlineable or not? The new > version is 263 bytecode instructions, the old version was 110. The default > inlining limit appears to be 35 bytecode instructions on cursory checking > (I may be wrong on this, though), so I don't think it was ever inlineable > in default configs. > > > > On your statement "we haven't seen performance gains" -- the starting > point of this thread was a friendly request to please point me to > instructions for running a broad range of Lucene indexing benchmarks, so I > can gather data for further discussion; from my perspective, we haven't > even gathered any data, so obviously we haven't seen any gains. > > > > Cheers, > > Thomas > > > > On Tue, Apr 25, 2023 at 4:27 PM Robert Muir <rcm...@gmail.com> wrote: > >> > >> There is literally one string, all-ascii. This won't fail if all the > >> shifts and masks are wrong. > >> > >> About the inlining, i'm not talking about cpu stuff, i'm talking about > >> java. There are limits to the size of methods that get inlined (e.g. > >> -XX:MaxInlineSize). If we make this method enormous like this, it may > >> have performance consequences. > >> > >> We still haven't seen any performance gain from this. Elasticsearch > >> putting huge unique IDs into indexed terms doesnt count. > >> > >> On Tue, Apr 25, 2023 at 10:25 AM Thomas Dullien > >> <thomas.dull...@elastic.co> wrote: > >> > > >> > Hey, > >> > > >> > so there are unit tests in TestStringHelper.java that test strings of > length greater than 8, and my change passes them. Could you explain what > you want tested? > >> > > >> > Cheers, > >> > Thomas > >> > > >> > On Tue, Apr 25, 2023 at 4:21 PM Robert Muir <rcm...@gmail.com> wrote: > >> >> > >> >> sure, but "if length > 8 return 1" might pass these same tests too, > >> >> yet cause a ton of hash collisions. > >> >> > >> >> I just think if you want to optimize for super-long strings, there > >> >> should be a unit test. > >> >> > >> >> On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien > >> >> <thomas.dull...@elastic.co> wrote: > >> >> > > >> >> > Hey, > >> >> > > >> >> > I am pretty confident about correctness. The change passes both > Lucene and ES regression tests and my careful reading of the code is pretty > certain that the output is the same. If you want me to randomly test the > result for a few hundred million random strings, I'm happy to do that, too, > if you have other suggestions for correctness testing, let me know. > >> >> > > >> >> > The change does increase the method size and may impact inlining - > but so does literally any code change, particularly in a JIT'ed environment > where placement of code (and hence things like instruction cache conflicts) > depend on the precise history of execution. The way I understand it, one > deals with this by benchmarking and measuring. > >> >> > > >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable > improvement in indexing speed - this is why I was asking about a broad > range of Lucene benchmarks; to verify that this is indeed the case for > Lucene-only, too. > >> >> > > >> >> > Let me know what data you'd like to see to decide whether this > patch is a good idea, and if there is consensus among the Lucene committers > that those are reasonable criteria, I'll work on producing that data. > >> >> > > >> >> > Cheers, > >> >> > Thomas > >> >> > > >> >> > > >> >> > > >> >> > On Tue, Apr 25, 2023 at 4:02 PM Robert Muir <rcm...@gmail.com> > wrote: > >> >> >> > >> >> >> well there is some cost, as it must add additional checks to see > if > >> >> >> its longer than 8. in your patch, additional loops. it increases > the > >> >> >> method size and may impact inlining and other things. also we > can't > >> >> >> forget about correctness, if the hash function does the wrong > thing it > >> >> >> could slow everything to a crawl. > >> >> >> > >> >> >> On Tue, Apr 25, 2023 at 9:56 AM Thomas Dullien > >> >> >> <thomas.dull...@elastic.co> wrote: > >> >> >> > > >> >> >> > Ah, I see what you mean. > >> >> >> > > >> >> >> > You are correct -- the change will not speed up a 5-byte word, > but it *will* speed up all 8+-byte words, at no cost to the shorter words. > >> >> >> > > >> >> >> > On Tue, Apr 25, 2023 at 3:20 PM Robert Muir <rcm...@gmail.com> > wrote: > >> >> >> >> > >> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't > going to > >> >> >> >> speed anything up. there aren't 8 bytes to process. > >> >> >> >> > >> >> >> >> On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien > >> >> >> >> <thomas.dull...@elastic.co.invalid> wrote: > >> >> >> >> > > >> >> >> >> > Is average word length <= 4 realistic though? I mean, even > the english wiki corpus has ~5, which would require two calls to the lucene > layer instead of one; e.g. multiple layers of virtual dispatch that are > unnecessary? > >> >> >> >> > > >> >> >> >> > You're not going to pay any cycles for reading 8 bytes > instead of 4 bytes, so the cost of doing so will be the same - while > speeding up in cases where 4 isn't quite enough? > >> >> >> >> > > >> >> >> >> > Cheers, > >> >> >> >> > Thomas > >> >> >> >> > > >> >> >> >> > On Tue, Apr 25, 2023 at 3:07 PM Robert Muir < > rcm...@gmail.com> wrote: > >> >> >> >> >> > >> >> >> >> >> i think from my perspective it has nothing to do with cpus > being > >> >> >> >> >> 32-bit or 64-bit and more to do with the average length of > terms in > >> >> >> >> >> most languages being smaller than 8. for the languages with > longer > >> >> >> >> >> word length, its usually because of complex morphology that > most users > >> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal > IMO. > >> >> >> >> >> languages from nature don't care about your cpu. > >> >> >> >> >> > >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless > >> >> >> >> >> <luc...@mikemccandless.com> wrote: > >> >> >> >> >> > > >> >> >> >> >> > For a truly "pure" indexing test I usually use a single > thread for indexing, and SerialMergeScheduler (using that single thread to > also do single-threaded merging). It makes the indexing take forever lol > but it produces "comparable" results. > >> >> >> >> >> > > >> >> >> >> >> > But ... this sounds like a great change anyway? Do we > really need to gate it on benchmark results? Do we think there could be a > downside e.g. slower indexing on (the dwindling) 32 bit CPUs? > >> >> >> >> >> > > >> >> >> >> >> > Mike McCandless > >> >> >> >> >> > > >> >> >> >> >> > http://blog.mikemccandless.com > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir < > rcm...@gmail.com> wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> I think the results of the benchmark will depend on the > properties of > >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil) > the average word > >> >> >> >> >> >> length is around 5 bytes so this optimization may not do > much. > >> >> >> >> >> >> > >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai < > zhai7...@gmail.com> wrote: > >> >> >> >> >> >> > > >> >> >> >> >> >> > I did a quick run with your patch, but since I turned > on the CMS as well as TieredMergePolicy I'm not sure how fair the > comparison is. Here's the result: > >> >> >> >> >> >> > Candidate: > >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 > docs > >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec) > >> >> >> >> >> >> > Indexer: finished (961877 msec) > >> >> >> >> >> >> > Baseline: > >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 > docs > >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec) > >> >> >> >> >> >> > Indexer: finished (964528 msec) > >> >> >> >> >> >> > > >> >> >> >> >> >> > For more accurate comparison I guess it's better to > use LogxxMergePolicy and turn off CMS? If you want to run it yourself you > can find the lines I quoted from the log file. > >> >> >> >> >> >> > > >> >> >> >> >> >> > Patrick > >> >> >> >> >> >> > > >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien < > thomas.dull...@elastic.co.invalid> wrote: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Hey all, > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> I've been experimenting with fixing some low-hanging > performance fruit in the ElasticSearch codebase, and came across the fact > that the MurmurHash implementation that is used by ByteRef.hashCode() is > reading 4 bytes per loop iteration (which is likely an artifact from 32-bit > architectures, which are ever-less-important). I made a small fix to change > the implementation to read 8 bytes per loop iteration; I expected a very > small impact (2-3% CPU or so over an indexing run in ElasticSearch), but > got a pretty nontrivial throughput improvement over a few indexing > benchmarks. > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded > in running the example from https://github.com/mikemccand/luceneutil - > but I couldn't figure out how to run indexing benchmarks and how to > interpret the results. > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Could someone help me in running the benchmarks for > the attached patch? > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Cheers, > >> >> >> >> >> >> >> Thomas > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > --------------------------------------------------------------------- > >> >> >> >> >> >> >> To unsubscribe, e-mail: > dev-unsubscr...@lucene.apache.org > >> >> >> >> >> >> >> For additional commands, e-mail: > dev-h...@lucene.apache.org > >> >> >> >> >> >> > >> >> >> >> >> >> > --------------------------------------------------------------------- > >> >> >> >> >> >> To unsubscribe, e-mail: > dev-unsubscr...@lucene.apache.org > >> >> >> >> >> >> For additional commands, e-mail: > dev-h...@lucene.apache.org > >> >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > --------------------------------------------------------------------- > >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> >> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> >> >> >> >