I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic, Hebrew) will have longer byte strings because of UTF8. German has large compound words.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 25, 2023, at 10:57 AM, Thomas Dullien > <thomas.dull...@elastic.co.INVALID> wrote: > > Hey all, > > ok, attached is a second patch that adds some unit tests; I am happy to add > more. > > This brings me back to my original question: I'd like to run some pretty > thorough benchmarking on Lucene, both for this change and for possible other > future changes, largely focused on indexing performance. What are good > command lines to do so? What are good corpora? > > Cheers, > Thomas > > On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien <thomas.dull...@elastic.co > <mailto:thomas.dull...@elastic.co>> wrote: >> Hey, >> >> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish >> official test vectors, see the following URLs: >> https://github.com/aappleby/smhasher/issues/6 >> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958 >> There is a link to a pastebin entry in the first issue, which leads to >> https://pastebin.com/kkggV9Vx >> >> Now, the test vectors in that pastebin do not match either the output of >> pre-change Lucene's murmur3, nor the output of the Python mmh3 package. That >> said, the pre-change Lucene and the mmh3 package agree, just not with the >> published list. >> >> There *are* test vectors in the source code for the mmh3 python package, >> which I could use, or cook up a set of bespoke ones, or both (I share the >> concern about 8-byte boundaries and signedness). >> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75 >> >> Cheers, >> Thomas >> >> On Tue, Apr 25, 2023 at 5:15 PM Robert Muir <rcm...@gmail.com >> <mailto:rcm...@gmail.com>> wrote: >>> i dont think we need a ton of random strings. But if you want to >>> optimize for strings of length 8, at a minimum there should be very >>> simple tests ensuring correctness for some boundary conditions (e.g. >>> string of length exactly 8). i would also strongly recommend testing >>> non-ascii since java is a language with signed integer types so it may >>> be susceptible to bugs where the input bytes have the "sign bit" set. >>> >>> IMO this could be 2 simple unit tests. >>> >>> usually at least with these kinds of algorithms you can also find >>> published "test vectors" that intend to seek out the corner cases. if >>> these exist for murmurhash, we should fold them in too. >>> >>> On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien >>> <thomas.dull...@elastic.co <mailto:thomas.dull...@elastic.co>> wrote: >>> > >>> > Hey, >>> > >>> > I offered to run a large number of random-string-hashes to ensure that >>> > the output is the same pre- and post-change. I can add an arbitrary >>> > number of such tests to TestStringHelper.java, just specify the number >>> > you wish. >>> > >>> > If your worry is that my change breaches the inlining bytecode limit: Did >>> > you check whether the old version was inlineable or not? The new version >>> > is 263 bytecode instructions, the old version was 110. The default >>> > inlining limit appears to be 35 bytecode instructions on cursory checking >>> > (I may be wrong on this, though), so I don't think it was ever inlineable >>> > in default configs. >>> > >>> > On your statement "we haven't seen performance gains" -- the starting >>> > point of this thread was a friendly request to please point me to >>> > instructions for running a broad range of Lucene indexing benchmarks, so >>> > I can gather data for further discussion; from my perspective, we haven't >>> > even gathered any data, so obviously we haven't seen any gains. >>> > >>> > Cheers, >>> > Thomas >>> > >>> > On Tue, Apr 25, 2023 at 4:27 PM Robert Muir <rcm...@gmail.com >>> > <mailto:rcm...@gmail.com>> wrote: >>> >> >>> >> There is literally one string, all-ascii. This won't fail if all the >>> >> shifts and masks are wrong. >>> >> >>> >> About the inlining, i'm not talking about cpu stuff, i'm talking about >>> >> java. There are limits to the size of methods that get inlined (e.g. >>> >> -XX:MaxInlineSize). If we make this method enormous like this, it may >>> >> have performance consequences. >>> >> >>> >> We still haven't seen any performance gain from this. Elasticsearch >>> >> putting huge unique IDs into indexed terms doesnt count. >>> >> >>> >> On Tue, Apr 25, 2023 at 10:25 AM Thomas Dullien >>> >> <thomas.dull...@elastic.co <mailto:thomas.dull...@elastic.co>> wrote: >>> >> > >>> >> > Hey, >>> >> > >>> >> > so there are unit tests in TestStringHelper.java that test strings of >>> >> > length greater than 8, and my change passes them. Could you explain >>> >> > what you want tested? >>> >> > >>> >> > Cheers, >>> >> > Thomas >>> >> > >>> >> > On Tue, Apr 25, 2023 at 4:21 PM Robert Muir <rcm...@gmail.com >>> >> > <mailto:rcm...@gmail.com>> wrote: >>> >> >> >>> >> >> sure, but "if length > 8 return 1" might pass these same tests too, >>> >> >> yet cause a ton of hash collisions. >>> >> >> >>> >> >> I just think if you want to optimize for super-long strings, there >>> >> >> should be a unit test. >>> >> >> >>> >> >> On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien >>> >> >> <thomas.dull...@elastic.co <mailto:thomas.dull...@elastic.co>> wrote: >>> >> >> > >>> >> >> > Hey, >>> >> >> > >>> >> >> > I am pretty confident about correctness. The change passes both >>> >> >> > Lucene and ES regression tests and my careful reading of the code >>> >> >> > is pretty certain that the output is the same. If you want me to >>> >> >> > randomly test the result for a few hundred million random strings, >>> >> >> > I'm happy to do that, too, if you have other suggestions for >>> >> >> > correctness testing, let me know. >>> >> >> > >>> >> >> > The change does increase the method size and may impact inlining - >>> >> >> > but so does literally any code change, particularly in a JIT'ed >>> >> >> > environment where placement of code (and hence things like >>> >> >> > instruction cache conflicts) depend on the precise history of >>> >> >> > execution. The way I understand it, one deals with this by >>> >> >> > benchmarking and measuring. >>> >> >> > >>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable >>> >> >> > improvement in indexing speed - this is why I was asking about a >>> >> >> > broad range of Lucene benchmarks; to verify that this is indeed the >>> >> >> > case for Lucene-only, too. >>> >> >> > >>> >> >> > Let me know what data you'd like to see to decide whether this >>> >> >> > patch is a good idea, and if there is consensus among the Lucene >>> >> >> > committers that those are reasonable criteria, I'll work on >>> >> >> > producing that data. >>> >> >> > >>> >> >> > Cheers, >>> >> >> > Thomas >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > On Tue, Apr 25, 2023 at 4:02 PM Robert Muir <rcm...@gmail.com >>> >> >> > <mailto:rcm...@gmail.com>> wrote: >>> >> >> >> >>> >> >> >> well there is some cost, as it must add additional checks to see if >>> >> >> >> its longer than 8. in your patch, additional loops. it increases >>> >> >> >> the >>> >> >> >> method size and may impact inlining and other things. also we can't >>> >> >> >> forget about correctness, if the hash function does the wrong >>> >> >> >> thing it >>> >> >> >> could slow everything to a crawl. >>> >> >> >> >>> >> >> >> On Tue, Apr 25, 2023 at 9:56 AM Thomas Dullien >>> >> >> >> <thomas.dull...@elastic.co <mailto:thomas.dull...@elastic.co>> >>> >> >> >> wrote: >>> >> >> >> > >>> >> >> >> > Ah, I see what you mean. >>> >> >> >> > >>> >> >> >> > You are correct -- the change will not speed up a 5-byte word, >>> >> >> >> > but it *will* speed up all 8+-byte words, at no cost to the >>> >> >> >> > shorter words. >>> >> >> >> > >>> >> >> >> > On Tue, Apr 25, 2023 at 3:20 PM Robert Muir <rcm...@gmail.com >>> >> >> >> > <mailto:rcm...@gmail.com>> wrote: >>> >> >> >> >> >>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't >>> >> >> >> >> going to >>> >> >> >> >> speed anything up. there aren't 8 bytes to process. >>> >> >> >> >> >>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien >>> >> >> >> >> <thomas.dull...@elastic.co >>> >> >> >> >> <mailto:thomas.dull...@elastic.co>.invalid> wrote: >>> >> >> >> >> > >>> >> >> >> >> > Is average word length <= 4 realistic though? I mean, even >>> >> >> >> >> > the english wiki corpus has ~5, which would require two calls >>> >> >> >> >> > to the lucene layer instead of one; e.g. multiple layers of >>> >> >> >> >> > virtual dispatch that are unnecessary? >>> >> >> >> >> > >>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes >>> >> >> >> >> > instead of 4 bytes, so the cost of doing so will be the same >>> >> >> >> >> > - while speeding up in cases where 4 isn't quite enough? >>> >> >> >> >> > >>> >> >> >> >> > Cheers, >>> >> >> >> >> > Thomas >>> >> >> >> >> > >>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07 PM Robert Muir <rcm...@gmail.com >>> >> >> >> >> > <mailto:rcm...@gmail.com>> wrote: >>> >> >> >> >> >> >>> >> >> >> >> >> i think from my perspective it has nothing to do with cpus >>> >> >> >> >> >> being >>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length of >>> >> >> >> >> >> terms in >>> >> >> >> >> >> most languages being smaller than 8. for the languages with >>> >> >> >> >> >> longer >>> >> >> >> >> >> word length, its usually because of complex morphology that >>> >> >> >> >> >> most users >>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal >>> >> >> >> >> >> IMO. >>> >> >> >> >> >> languages from nature don't care about your cpu. >>> >> >> >> >> >> >>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless >>> >> >> >> >> >> <luc...@mikemccandless.com >>> >> >> >> >> >> <mailto:luc...@mikemccandless.com>> wrote: >>> >> >> >> >> >> > >>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a single >>> >> >> >> >> >> > thread for indexing, and SerialMergeScheduler (using that >>> >> >> >> >> >> > single thread to also do single-threaded merging). It >>> >> >> >> >> >> > makes the indexing take forever lol but it produces >>> >> >> >> >> >> > "comparable" results. >>> >> >> >> >> >> > >>> >> >> >> >> >> > But ... this sounds like a great change anyway? Do we >>> >> >> >> >> >> > really need to gate it on benchmark results? Do we think >>> >> >> >> >> >> > there could be a downside e.g. slower indexing on (the >>> >> >> >> >> >> > dwindling) 32 bit CPUs? >>> >> >> >> >> >> > >>> >> >> >> >> >> > Mike McCandless >>> >> >> >> >> >> > >>> >> >> >> >> >> > http://blog.mikemccandless.com >>> >> >> >> >> >> > <http://blog.mikemccandless.com/> >>> >> >> >> >> >> > >>> >> >> >> >> >> > >>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir >>> >> >> >> >> >> > <rcm...@gmail.com <mailto:rcm...@gmail.com>> wrote: >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> I think the results of the benchmark will depend on the >>> >> >> >> >> >> >> properties of >>> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil) the >>> >> >> >> >> >> >> average word >>> >> >> >> >> >> >> length is around 5 bytes so this optimization may not do >>> >> >> >> >> >> >> much. >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai >>> >> >> >> >> >> >> <zhai7...@gmail.com <mailto:zhai7...@gmail.com>> wrote: >>> >> >> >> >> >> >> > >>> >> >> >> >> >> >> > I did a quick run with your patch, but since I turned >>> >> >> >> >> >> >> > on the CMS as well as TieredMergePolicy I'm not sure >>> >> >> >> >> >> >> > how fair the comparison is. Here's the result: >>> >> >> >> >> >> >> > Candidate: >>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 >>> >> >> >> >> >> >> > docs >>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec) >>> >> >> >> >> >> >> > Indexer: finished (961877 msec) >>> >> >> >> >> >> >> > Baseline: >>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 >>> >> >> >> >> >> >> > docs >>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec) >>> >> >> >> >> >> >> > Indexer: finished (964528 msec) >>> >> >> >> >> >> >> > >>> >> >> >> >> >> >> > For more accurate comparison I guess it's better to use >>> >> >> >> >> >> >> > LogxxMergePolicy and turn off CMS? If you want to run >>> >> >> >> >> >> >> > it yourself you can find the lines I quoted from the >>> >> >> >> >> >> >> > log file. >>> >> >> >> >> >> >> > >>> >> >> >> >> >> >> > Patrick >>> >> >> >> >> >> >> > >>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien >>> >> >> >> >> >> >> > <thomas.dull...@elastic.co >>> >> >> >> >> >> >> > <mailto:thomas.dull...@elastic.co>.invalid> wrote: >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> Hey all, >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> I've been experimenting with fixing some low-hanging >>> >> >> >> >> >> >> >> performance fruit in the ElasticSearch codebase, and >>> >> >> >> >> >> >> >> came across the fact that the MurmurHash >>> >> >> >> >> >> >> >> implementation that is used by ByteRef.hashCode() is >>> >> >> >> >> >> >> >> reading 4 bytes per loop iteration (which is likely an >>> >> >> >> >> >> >> >> artifact from 32-bit architectures, which are >>> >> >> >> >> >> >> >> ever-less-important). I made a small fix to change the >>> >> >> >> >> >> >> >> implementation to read 8 bytes per loop iteration; I >>> >> >> >> >> >> >> >> expected a very small impact (2-3% CPU or so over an >>> >> >> >> >> >> >> >> indexing run in ElasticSearch), but got a pretty >>> >> >> >> >> >> >> >> nontrivial throughput improvement over a few indexing >>> >> >> >> >> >> >> >> benchmarks. >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded >>> >> >> >> >> >> >> >> in running the example from >>> >> >> >> >> >> >> >> https://github.com/mikemccand/luceneutil - but I >>> >> >> >> >> >> >> >> couldn't figure out how to run indexing benchmarks and >>> >> >> >> >> >> >> >> how to interpret the results. >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks for >>> >> >> >> >> >> >> >> the attached patch? >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> Cheers, >>> >> >> >> >> >> >> >> Thomas >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> --------------------------------------------------------------------- >>> >> >> >> >> >> >> >> To unsubscribe, e-mail: >>> >> >> >> >> >> >> >> dev-unsubscr...@lucene.apache.org >>> >> >> >> >> >> >> >> <mailto:dev-unsubscr...@lucene.apache.org> >>> >> >> >> >> >> >> >> For additional commands, e-mail: >>> >> >> >> >> >> >> >> dev-h...@lucene.apache.org >>> >> >> >> >> >> >> >> <mailto:dev-h...@lucene.apache.org> >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> --------------------------------------------------------------------- >>> >> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >> >> >> >> >> >> <mailto:dev-unsubscr...@lucene.apache.org> >>> >> >> >> >> >> >> For additional commands, e-mail: >>> >> >> >> >> >> >> dev-h...@lucene.apache.org >>> >> >> >> >> >> >> <mailto:dev-h...@lucene.apache.org> >>> >> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> --------------------------------------------------------------------- >>> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >> >> >> >> >> <mailto:dev-unsubscr...@lucene.apache.org> >>> >> >> >> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >> >> >> >> <mailto:dev-h...@lucene.apache.org> >>> >> >> >> >> >> > <murmur-tests.patch> > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org