mikemccand commented on PR #15779: URL: https://github.com/apache/lucene/pull/15779#issuecomment-3990897502
> Thanks to OpenAI's Codex, I am able to validate my implementation very quickly, which is fantastic. +1 -- these genai tools are amazing. Claude (Opus 4.6) helped me add ascii-art (well, Unicode) [sparkle histograms to visualize smelly vectors in `luceneutil`'s `knnPerfTest.py` (vector search benchmark).](https://github.com/mikemccand/luceneutil/blob/main/src/python/knnPerfTest.py) Wow, ~51% net topline (time to insert N keys) speedup by pre-computing all hashes into up-front `int[]` during rehash!? You are doing precisely the same amount of CPU work (1X hash computation for each `BytesRef` in the `BytesRefHash`), just doing it up front (this change) vs doing it interleaved along with the insert into the new larger hash table? I.e. we are not somehow saving further `hash()` calls done when stepping through collisions on insert. I guess reading the key from random packed `byte[]` location is cache-painful. Similarly, writing into the bigger hash table is also cache-painful. But if you try to do both at once -> cache thrashing (the two fight with each other, greatly reducing cache hit %). If you run with `perf stat -ddd` it'll probably show exactly this from its counters? I asked genai to look at the PR and explain the speedup. [Claude Opus 4.6](https://claude.ai/share/94566b7c-eccb-425e-bf2f-c74d7e40a0b4) did well. So did [Grok (Expert)](https://grok.com/share/c2hhcmQtMw_13d1dd8d-985d-48aa-9b94-b02143298a18). [Gemini (Thinking) was (surprisingly) not great](https://gemini.google.com/share/1b222bf6a692) -- it confusingly thought this PR introduced power-of-two hash table sizes (that is pre-existing), and that this PR switched from modulo math to bitmasks (also pre-existing). Anyways, I love this change. How portable is it? If you use [Lucene's `aws-jmh` benchmark infra](https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh) across various CPU flavors, do the gains hold up? I expect on nightly benchmarking box (`beast3`) this would be big win -- it has four "chiplets" and inter-chiplet latency is much higher than within-chiplet and so I have to tell OS what Numa nodes to use, etc. Do you have the benchy source you ran -- I'll test on `beast3`. But I'm worried about the surge in transient RAM. It's especially bad timing because we are already surging to 3X the current hash table in transience (1X current one, 2X being rehashed into) -- surge on surge. Could we change it to do that pre-computation in chunks? Also: since `BytesRefHash` now uses the otherwise 0 leading bits of each int in `int[] ids` array to hold some of the hash code bits, couldn't we often avoid recomputing the full hash entirely? We know the lower bits of the hash (position in the current `ids` array), we know more bits from [the recent opto](https://github.com/apache/lucene/commit/2f66c8f6668622c0c82b720d47ccd43b57e32edb), don't we have enough bits? We need just one more bit for the initial hash slot ... on linear probe it just increments... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
