Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Yonik Seeley
On Fri, Aug 25, 2023 at 6:34 PM Thomas Dullien
 wrote:
> apologies if the chart is incorrect.

The chart isn't necessarily incorrect, but it probably isn't the
most relevant statistic here. "Lies, damn lies, and statistics" ;-)
The average length of unique English words is not the same as the average
word length in an English corpus.

> Now, the test vectors in that pastebin do not match either the output of
pre-change Lucene's murmur3, nor the output of the Python mmh3 package.
That said, the pre-change Lucene and the mmh3 package agree, just not with
the published list.

Interesting. FWIW, I created my own test vectors by running the original
murmur code and hashing the hash of a string (to make sure some high bits
were set) and also testing different offsets to make sure chunking didn't
change the hash values:
https://github.com/yonik/java_util/blob/master/test/util/hash/TestMurmurHash3.java

Anyway, we shouldn't let improvements be lost in the noise... if only some
benchmarks show improvement (with others being indifferent), then it seems
like it would be a good change.  Occasionally this requires adding a new
benchmark that has different bottlenecks (and hence can highlight the
changes.)

-Yonik


Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Marcus Eagan
Thomas,

Also, is it possible to open this patch as a pull request in GitHub?

I guess it does not matter for a lot of the people here. It would make it
easier for more people to collaborate in that medium given the shift to
GitHub recently.

- Marcus



On Fri, Aug 25, 2023 at 7:03 PM Marcus Eagan  wrote:

> Hi Thomas,
>
> Thank you for the hard work thus far. I'm excited to see if the community
> can benefit from the work. The best way to use the lucene bench is to run
> the baseline and candidate branches as described here
> 
> .
>
> I can help you with it and even submit an update to the benchmark repo as
> needed if we find that we can improve some of the steps there to make it
> easier for onlookers. Have you already tried setting up lucene_util?
>
> - Marcus
>
>
>
> On Fri, Aug 25, 2023 at 6:34 PM Thomas Dullien
>  wrote:
>
>> Hey all,
>>
>> apologies if the chart is incorrect. Anyhow, I think the more important
>> questions are:
>>
>> 1) Which benchmarks does the Lucene community have that y'all would like
>> to see an improvement on before accepting this (or any other future)
>> performance patches?
>>
>> I'm guessing the reason why the patch improves http_log performance is
>> because that benchmark indexes many IP addresses, which tend to be 9-15
>> bytes in length. That does not strike me as an atypical workload.
>>
>> I've also done some quick experiments to estimate the average UTF-8 word
>> size of languages in non-ASCII scripts (for example Hindi), and it seems to
>> me the average word size is 2-3 times larger than english because most
>> indic characters will encode to 2-3 bytes. The following excerpt from Hindi
>> Wikipedia is 2242 bytes, but just 146 words
>>  begin hindi-example.txt 
>> १० वीं शताब्दी के बाद घुमन्तु मुस्लिम वंशों ने जातियता तथा धर्म द्वारा
>> संघठित तेज घोड़ों से युक्त बड़ी सेना के द्वारा उत्तर-पश्चिमी मैदानों पर बार
>> बार आकर्मण किया, अंततः १२०६ [[दिल्ली सल्तनत|इस्लामीक दिल्ली सल्तनत]] की
>> स्थापना हुई।{{sfn|Ludden|2002|p = 68}} उन्हें उतर भारत  को
>>  धिक नियंत्रित करना था तथा दक्षिण भारत पर आकर्मण करना था। भारतीय कुलीन
>> वर्ग के विघटनकारी सल्तनत ने बड़े पैमाने पर गैर-मुस्लिमों को स्वयं के
>> रीतिरिवाजों पर छोड़ दिया।{{sfn|Asher|Talbot|2008|p =
>> 47}}{{sfn|Metcalf|Metcalf|2006|p = 6}} १३ वीं शताब्दी में [[मंगोल साम्राज्
>> य|मंगोलों]] द्वारा किये के विनाशकारी आकर्मण से भारत की रक्षा की। सल्तनत
>> के पतन के कारण स्वशासित [[विजयनगर साम्राज्य|विजयनगर साम्राज्य]] का मार्ग
>> प्रशस्त हुआ।{{sfn|Asher|Talbot|2008|p = 53}} एक मजबूत [[शैव|शैव परंपरा]] और
>> सल्तनत की सैन्य तकनीक पर निर्माण करते हुए साम्रा
>> ज्य ने भारत के विशाल भाग पर शासन किया और इसके बाद लंबे समय तक दक्षिण
>> भारतीय समाज को प्रभावित किया।{{sfn|Metcalf|Metcalf|2006|p =
>> 12}}{{sfn|Asher|Talbot|2008|p = 53}}
>>  end hindi-example.txt 
>>
>> cat hindi-example.txt | wc -w
>> 146
>> 2242 divided by 146 yields a word length of ~15 bytes, so I'd be
>> surprised if average word length of Hindi wikipedia was below 12 bytes.
>>
>> Do y'all wish for me to create another benchmark for indexing indic
>> scripts and large corpora of IPv4 and IPv6 addresses (both things that seem
>> not very niche), and if the patch shows improvement there, will y'all
>> accept it?
>>
>> 2) It'd be great if there was some public documentation about "what we
>> want to see from a performance improvement before we accept it". I
>> personally find the discussion around this patch to be bewilderingly long,
>> and I am happy to help work on such a guideline with others -- IMO having a
>> clear set of hurdles is preferable to the back-and-forth we've had so far?
>>
>> Cheers,
>> Thomas
>>
>>
>>
>>
>>
>> On Fri, Aug 25, 2023 at 3:38 PM Robert Muir  wrote:
>>
>>> chart is wrong, average word length for english is like 5.
>>>
>>> On Fri, Aug 25, 2023 at 9:35 AM Thomas Dullien
>>>  wrote:
>>> >
>>> > Hey all,
>>> >
>>> > another data point: There's a diagram with the relevant distributions
>>> of word lengths in various languages here:
>>> >
>>> >
>>> https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/
>>> >
>>> > While English is close to the 8-byte limit, average word length in
>>> German is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes.
>>> I'll gather some averages over the various Wikipedia indices.
>>> >
>>> > Cheers,
>>> > Thomas
>>> >
>>> > On Thu, Aug 24, 2023 at 2:09 PM Thomas Dullien <
>>> thomas.dull...@elastic.co> wrote:
>>> >>
>>> >> Hey there,
>>> >>
>>> >> reviving this thread. To clarify: In order to show this patch is
>>> worth doing, I should index a bunch of natural-language documents
>>> (whichever language that is) and show that the patch brings a performance
>>> benefit?
>>> >>
>>> >> (Just clarifying, because at least inside ElasticSearch for the logs
>>> use-case, it turns out that it does provide a performance benefit -- but I
>>> want to make sure I 

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Marcus Eagan
Hi Thomas,

Thank you for the hard work thus far. I'm excited to see if the community
can benefit from the work. The best way to use the lucene bench is to run
the baseline and candidate branches as described here

.

I can help you with it and even submit an update to the benchmark repo as
needed if we find that we can improve some of the steps there to make it
easier for onlookers. Have you already tried setting up lucene_util?

- Marcus



On Fri, Aug 25, 2023 at 6:34 PM Thomas Dullien
 wrote:

> Hey all,
>
> apologies if the chart is incorrect. Anyhow, I think the more important
> questions are:
>
> 1) Which benchmarks does the Lucene community have that y'all would like
> to see an improvement on before accepting this (or any other future)
> performance patches?
>
> I'm guessing the reason why the patch improves http_log performance is
> because that benchmark indexes many IP addresses, which tend to be 9-15
> bytes in length. That does not strike me as an atypical workload.
>
> I've also done some quick experiments to estimate the average UTF-8 word
> size of languages in non-ASCII scripts (for example Hindi), and it seems to
> me the average word size is 2-3 times larger than english because most
> indic characters will encode to 2-3 bytes. The following excerpt from Hindi
> Wikipedia is 2242 bytes, but just 146 words
>  begin hindi-example.txt 
> १० वीं शताब्दी के बाद घुमन्तु मुस्लिम वंशों ने जातियता तथा धर्म द्वारा
> संघठित तेज घोड़ों से युक्त बड़ी सेना के द्वारा उत्तर-पश्चिमी मैदानों पर बार
> बार आकर्मण किया, अंततः १२०६ [[दिल्ली सल्तनत|इस्लामीक दिल्ली सल्तनत]] की
> स्थापना हुई।{{sfn|Ludden|2002|p = 68}} उन्हें उतर भारत  को
>  धिक नियंत्रित करना था तथा दक्षिण भारत पर आकर्मण करना था। भारतीय कुलीन
> वर्ग के विघटनकारी सल्तनत ने बड़े पैमाने पर गैर-मुस्लिमों को स्वयं के
> रीतिरिवाजों पर छोड़ दिया।{{sfn|Asher|Talbot|2008|p =
> 47}}{{sfn|Metcalf|Metcalf|2006|p = 6}} १३ वीं शताब्दी में [[मंगोल साम्राज्
> य|मंगोलों]] द्वारा किये के विनाशकारी आकर्मण से भारत की रक्षा की। सल्तनत के
> पतन के कारण स्वशासित [[विजयनगर साम्राज्य|विजयनगर साम्राज्य]] का मार्ग
> प्रशस्त हुआ।{{sfn|Asher|Talbot|2008|p = 53}} एक मजबूत [[शैव|शैव परंपरा]] और
> सल्तनत की सैन्य तकनीक पर निर्माण करते हुए साम्रा
> ज्य ने भारत के विशाल भाग पर शासन किया और इसके बाद लंबे समय तक दक्षिण
> भारतीय समाज को प्रभावित किया।{{sfn|Metcalf|Metcalf|2006|p =
> 12}}{{sfn|Asher|Talbot|2008|p = 53}}
>  end hindi-example.txt 
>
> cat hindi-example.txt | wc -w
> 146
> 2242 divided by 146 yields a word length of ~15 bytes, so I'd be surprised
> if average word length of Hindi wikipedia was below 12 bytes.
>
> Do y'all wish for me to create another benchmark for indexing indic
> scripts and large corpora of IPv4 and IPv6 addresses (both things that seem
> not very niche), and if the patch shows improvement there, will y'all
> accept it?
>
> 2) It'd be great if there was some public documentation about "what we
> want to see from a performance improvement before we accept it". I
> personally find the discussion around this patch to be bewilderingly long,
> and I am happy to help work on such a guideline with others -- IMO having a
> clear set of hurdles is preferable to the back-and-forth we've had so far?
>
> Cheers,
> Thomas
>
>
>
>
>
> On Fri, Aug 25, 2023 at 3:38 PM Robert Muir  wrote:
>
>> chart is wrong, average word length for english is like 5.
>>
>> On Fri, Aug 25, 2023 at 9:35 AM Thomas Dullien
>>  wrote:
>> >
>> > Hey all,
>> >
>> > another data point: There's a diagram with the relevant distributions
>> of word lengths in various languages here:
>> >
>> >
>> https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/
>> >
>> > While English is close to the 8-byte limit, average word length in
>> German is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes.
>> I'll gather some averages over the various Wikipedia indices.
>> >
>> > Cheers,
>> > Thomas
>> >
>> > On Thu, Aug 24, 2023 at 2:09 PM Thomas Dullien <
>> thomas.dull...@elastic.co> wrote:
>> >>
>> >> Hey there,
>> >>
>> >> reviving this thread. To clarify: In order to show this patch is worth
>> doing, I should index a bunch of natural-language documents (whichever
>> language that is) and show that the patch brings a performance benefit?
>> >>
>> >> (Just clarifying, because at least inside ElasticSearch for the logs
>> use-case, it turns out that it does provide a performance benefit -- but I
>> want to make sure I understand what the Lucene community wishes to see as
>> "evidence" this is worth pursuing :-)
>> >>
>> >> Cheers,
>> >> Thomas
>> >>
>> >> On Tue, Apr 25, 2023 at 8:14 PM Walter Underwood <
>> wun...@wunderwood.org> wrote:
>> >>>
>> >>> I would recommend some non-English tests. Non-Latin scripts (CJK,
>> Arabic, Hebrew) will have longer byte strings because of UTF8. German has
>> large compound words.
>> >>>
>> >>> wunder
>> >>> 

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Thomas Dullien
Hey all,

apologies if the chart is incorrect. Anyhow, I think the more important
questions are:

1) Which benchmarks does the Lucene community have that y'all would like to
see an improvement on before accepting this (or any other future)
performance patches?

I'm guessing the reason why the patch improves http_log performance is
because that benchmark indexes many IP addresses, which tend to be 9-15
bytes in length. That does not strike me as an atypical workload.

I've also done some quick experiments to estimate the average UTF-8 word
size of languages in non-ASCII scripts (for example Hindi), and it seems to
me the average word size is 2-3 times larger than english because most
indic characters will encode to 2-3 bytes. The following excerpt from Hindi
Wikipedia is 2242 bytes, but just 146 words
 begin hindi-example.txt 
१० वीं शताब्दी के बाद घुमन्तु मुस्लिम वंशों ने जातियता तथा धर्म द्वारा
संघठित तेज घोड़ों से युक्त बड़ी सेना के द्वारा उत्तर-पश्चिमी मैदानों पर बार
बार आकर्मण किया, अंततः १२०६ [[दिल्ली सल्तनत|इस्लामीक दिल्ली सल्तनत]] की
स्थापना हुई।{{sfn|Ludden|2002|p = 68}} उन्हें उतर भारत  को
 धिक नियंत्रित करना था तथा दक्षिण भारत पर आकर्मण करना था। भारतीय कुलीन वर्ग
के विघटनकारी सल्तनत ने बड़े पैमाने पर गैर-मुस्लिमों को स्वयं के रीतिरिवाजों
पर छोड़ दिया।{{sfn|Asher|Talbot|2008|p = 47}}{{sfn|Metcalf|Metcalf|2006|p =
6}} १३ वीं शताब्दी में [[मंगोल साम्राज्
य|मंगोलों]] द्वारा किये के विनाशकारी आकर्मण से भारत की रक्षा की। सल्तनत के
पतन के कारण स्वशासित [[विजयनगर साम्राज्य|विजयनगर साम्राज्य]] का मार्ग
प्रशस्त हुआ।{{sfn|Asher|Talbot|2008|p = 53}} एक मजबूत [[शैव|शैव परंपरा]] और
सल्तनत की सैन्य तकनीक पर निर्माण करते हुए साम्रा
ज्य ने भारत के विशाल भाग पर शासन किया और इसके बाद लंबे समय तक दक्षिण भारतीय
समाज को प्रभावित किया।{{sfn|Metcalf|Metcalf|2006|p =
12}}{{sfn|Asher|Talbot|2008|p = 53}}
 end hindi-example.txt 

cat hindi-example.txt | wc -w
146
2242 divided by 146 yields a word length of ~15 bytes, so I'd be surprised
if average word length of Hindi wikipedia was below 12 bytes.

Do y'all wish for me to create another benchmark for indexing indic scripts
and large corpora of IPv4 and IPv6 addresses (both things that seem not
very niche), and if the patch shows improvement there, will y'all accept it?

2) It'd be great if there was some public documentation about "what we want
to see from a performance improvement before we accept it". I personally
find the discussion around this patch to be bewilderingly long, and I am
happy to help work on such a guideline with others -- IMO having a clear
set of hurdles is preferable to the back-and-forth we've had so far?

Cheers,
Thomas





On Fri, Aug 25, 2023 at 3:38 PM Robert Muir  wrote:

> chart is wrong, average word length for english is like 5.
>
> On Fri, Aug 25, 2023 at 9:35 AM Thomas Dullien
>  wrote:
> >
> > Hey all,
> >
> > another data point: There's a diagram with the relevant distributions of
> word lengths in various languages here:
> >
> >
> https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/
> >
> > While English is close to the 8-byte limit, average word length in
> German is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes.
> I'll gather some averages over the various Wikipedia indices.
> >
> > Cheers,
> > Thomas
> >
> > On Thu, Aug 24, 2023 at 2:09 PM Thomas Dullien <
> thomas.dull...@elastic.co> wrote:
> >>
> >> Hey there,
> >>
> >> reviving this thread. To clarify: In order to show this patch is worth
> doing, I should index a bunch of natural-language documents (whichever
> language that is) and show that the patch brings a performance benefit?
> >>
> >> (Just clarifying, because at least inside ElasticSearch for the logs
> use-case, it turns out that it does provide a performance benefit -- but I
> want to make sure I understand what the Lucene community wishes to see as
> "evidence" this is worth pursuing :-)
> >>
> >> Cheers,
> >> Thomas
> >>
> >> On Tue, Apr 25, 2023 at 8:14 PM Walter Underwood 
> wrote:
> >>>
> >>> I would recommend some non-English tests. Non-Latin scripts (CJK,
> Arabic, Hebrew) will have longer byte strings because of UTF8. German has
> large compound words.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien <
> thomas.dull...@elastic.co.INVALID> wrote:
> >>>
> >>> Hey all,
> >>>
> >>> ok, attached is a second patch that adds some unit tests; I am happy
> to add more.
> >>>
> >>> This brings me back to my original question: I'd like to run some
> pretty thorough benchmarking on Lucene, both for this change and for
> possible other future changes, largely focused on indexing performance.
> What are good command lines to do so? What are good corpora?
> >>>
> >>> Cheers,
> >>> Thomas
> >>>
> >>> On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien <
> thomas.dull...@elastic.co> wrote:
> 
>  Hey,
> 
> 

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Robert Muir
chart is wrong, average word length for english is like 5.

On Fri, Aug 25, 2023 at 9:35 AM Thomas Dullien
 wrote:
>
> Hey all,
>
> another data point: There's a diagram with the relevant distributions of word 
> lengths in various languages here:
>
> https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/
>
> While English is close to the 8-byte limit, average word length in German is 
> 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes. I'll gather 
> some averages over the various Wikipedia indices.
>
> Cheers,
> Thomas
>
> On Thu, Aug 24, 2023 at 2:09 PM Thomas Dullien  
> wrote:
>>
>> Hey there,
>>
>> reviving this thread. To clarify: In order to show this patch is worth 
>> doing, I should index a bunch of natural-language documents (whichever 
>> language that is) and show that the patch brings a performance benefit?
>>
>> (Just clarifying, because at least inside ElasticSearch for the logs 
>> use-case, it turns out that it does provide a performance benefit -- but I 
>> want to make sure I understand what the Lucene community wishes to see as 
>> "evidence" this is worth pursuing :-)
>>
>> Cheers,
>> Thomas
>>
>> On Tue, Apr 25, 2023 at 8:14 PM Walter Underwood  
>> wrote:
>>>
>>> I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic, 
>>> Hebrew) will have longer byte strings because of UTF8. German has large 
>>> compound words.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien 
>>>  wrote:
>>>
>>> Hey all,
>>>
>>> ok, attached is a second patch that adds some unit tests; I am happy to add 
>>> more.
>>>
>>> This brings me back to my original question: I'd like to run some pretty 
>>> thorough benchmarking on Lucene, both for this change and for possible 
>>> other future changes, largely focused on indexing performance. What are 
>>> good command lines to do so? What are good corpora?
>>>
>>> Cheers,
>>> Thomas
>>>
>>> On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien  
>>> wrote:

 Hey,

 ok, I've done some digging: Unfortunately, MurmurHash3 does not publish 
 official test vectors, see the following URLs:
 https://github.com/aappleby/smhasher/issues/6
 https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
 There is a link to a pastebin entry in the first issue, which leads to 
 https://pastebin.com/kkggV9Vx

 Now, the test vectors in that pastebin do not match either the output of 
 pre-change Lucene's murmur3, nor the output of the Python mmh3 package. 
 That said, the pre-change Lucene and the mmh3 package agree, just not with 
 the published list.

 There *are* test vectors in the source code for the mmh3 python package, 
 which I could use, or cook up a set of bespoke ones, or both (I share the 
 concern about 8-byte boundaries and signedness).
 https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75

 Cheers,
 Thomas

 On Tue, Apr 25, 2023 at 5:15 PM Robert Muir  wrote:
>
> i dont think we need a ton of random strings. But if you want to
> optimize for strings of length 8, at a minimum there should be very
> simple tests ensuring correctness for some boundary conditions (e.g.
> string of length exactly 8). i would also strongly recommend testing
> non-ascii since java is a language with signed integer types so it may
> be susceptible to bugs where the input bytes have the "sign bit" set.
>
> IMO this could be 2 simple unit tests.
>
> usually at least with these kinds of algorithms you can also find
> published "test vectors" that intend to seek out the corner cases. if
> these exist for murmurhash, we should fold them in too.
>
> On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien
>  wrote:
> >
> > Hey,
> >
> > I offered to run a large number of random-string-hashes to ensure that 
> > the output is the same pre- and post-change. I can add an arbitrary 
> > number of such tests to TestStringHelper.java, just specify the number 
> > you wish.
> >
> > If your worry is that my change breaches the inlining bytecode limit: 
> > Did you check whether the old version was inlineable or not? The new 
> > version is 263 bytecode instructions, the old version was 110. The 
> > default inlining limit appears to be 35 bytecode instructions on 
> > cursory checking (I may be wrong on this, though), so I don't think it 
> > was ever inlineable in default configs.
> >
> > On your statement "we haven't seen performance gains" -- the starting 
> > point of this thread was a friendly request to please point me to 
> > instructions for running a broad range of Lucene indexing benchmarks, 
> > so I can gather data for 

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Thomas Dullien
Hey all,

another data point: There's a diagram with the relevant distributions of
word lengths in various languages here:

https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/

While English is close to the 8-byte limit, average word length in German
is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes. I'll
gather some averages over the various Wikipedia indices.

Cheers,
Thomas

On Thu, Aug 24, 2023 at 2:09 PM Thomas Dullien 
wrote:

> Hey there,
>
> reviving this thread. To clarify: In order to show this patch is worth
> doing, I should index a bunch of natural-language documents (whichever
> language that is) and show that the patch brings a performance benefit?
>
> (Just clarifying, because at least inside ElasticSearch for the logs
> use-case, it turns out that it does provide a performance benefit -- but I
> want to make sure I understand what the Lucene community wishes to see as
> "evidence" this is worth pursuing :-)
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 8:14 PM Walter Underwood 
> wrote:
>
>> I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic,
>> Hebrew) will have longer byte strings because of UTF8. German has large
>> compound words.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien
>>  wrote:
>>
>> Hey all,
>>
>> ok, attached is a second patch that adds some unit tests; I am happy to
>> add more.
>>
>> This brings me back to my original question: I'd like to run some pretty
>> thorough benchmarking on Lucene, both for this change and for possible
>> other future changes, largely focused on indexing performance. What are
>> good command lines to do so? What are good corpora?
>>
>> Cheers,
>> Thomas
>>
>> On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien 
>> wrote:
>>
>>> Hey,
>>>
>>> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish
>>> official test vectors, see the following URLs:
>>> https://github.com/aappleby/smhasher/issues/6
>>>
>>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
>>> There is a link to a pastebin entry in the first issue, which leads to
>>> https://pastebin.com/kkggV9Vx
>>>
>>> Now, the test vectors in that pastebin do not match either the output of
>>> pre-change Lucene's murmur3, nor the output of the Python mmh3 package.
>>> That said, the pre-change Lucene and the mmh3 package agree, just not with
>>> the published list.
>>>
>>> There *are* test vectors in the source code for the mmh3 python package,
>>> which I could use, or cook up a set of bespoke ones, or both (I share the
>>> concern about 8-byte boundaries and signedness).
>>>
>>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>>>
>>> Cheers,
>>> Thomas
>>>
>>> On Tue, Apr 25, 2023 at 5:15 PM Robert Muir  wrote:
>>>
 i dont think we need a ton of random strings. But if you want to
 optimize for strings of length 8, at a minimum there should be very
 simple tests ensuring correctness for some boundary conditions (e.g.
 string of length exactly 8). i would also strongly recommend testing
 non-ascii since java is a language with signed integer types so it may
 be susceptible to bugs where the input bytes have the "sign bit" set.

 IMO this could be 2 simple unit tests.

 usually at least with these kinds of algorithms you can also find
 published "test vectors" that intend to seek out the corner cases. if
 these exist for murmurhash, we should fold them in too.

 On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien
  wrote:
 >
 > Hey,
 >
 > I offered to run a large number of random-string-hashes to ensure
 that the output is the same pre- and post-change. I can add an arbitrary
 number of such tests to TestStringHelper.java, just specify the number you
 wish.
 >
 > If your worry is that my change breaches the inlining bytecode limit:
 Did you check whether the old version was inlineable or not? The new
 version is 263 bytecode instructions, the old version was 110. The default
 inlining limit appears to be 35 bytecode instructions on cursory checking
 (I may be wrong on this, though), so I don't think it was ever inlineable
 in default configs.
 >
 > On your statement "we haven't seen performance gains" -- the starting
 point of this thread was a friendly request to please point me to
 instructions for running a broad range of Lucene indexing benchmarks, so I
 can gather data for further discussion; from my perspective, we haven't
 even gathered any data, so obviously we haven't seen any gains.
 >
 > Cheers,
 > Thomas
 >
 > On Tue, Apr 25, 2023 at 4:27 PM Robert Muir  wrote:
 >>
 >> There is literally one string, all-ascii. This won't fail if all the