mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1751999625
Here are the results from running `test_all_sizes.py` then
`results_to_md.py`:
|NodeHash size|FST (mb)|RAM (mb)|FST build time (sec)|
|-------------|--------|--------|----------------|
|0|577.4|0.0|35.2|
|4|586.5|0.0|43.2|
|8|587.0|0.0|46.4|
|16|585.2|0.0|44.8|
|32|582.0|0.0|45.9|
|64|578.8|0.0|45.4|
|128|573.0|0.0|45.9|
|256|563.6|0.0|46.1|
|512|551.2|0.0|45.4|
|1024|537.5|0.0|45.7|
|2048|523.4|0.0|46.0|
|4096|509.5|0.1|45.6|
|8192|495.8|0.1|45.2|
|16384|481.8|0.2|46.3|
|32768|461.1|0.5|45.2|
|65536|447.2|1.0|45.7|
|131072|432.4|2.0|46.3|
|262144|418.6|4.0|46.3|
|524288|402.4|8.0|46.9|
|1048576|391.0|16.0|50.0|
|2097152|380.8|32.0|55.2|
|4194304|371.4|64.0|58.3|
|8388608|362.5|128.0|59.9|
|16777216|356.1|256.0|59.3|
|33554432|351.4|512.0|57.3|
|67108864|350.2|1024.0|52.6|
|134217728|350.2|2048.0|49.2|
|268435456|350.2|4096.0|48.4|
|536870912|350.2|8192.0|46.9|
|1073741824|350.2|16384.0|44.5|
One WTF (wow that's funny) is why a `NodeHash` size of 0 (no prefix sharing)
creates a smaller FST than the tiny `NodeHash` sizes: it should be monotonic
since the `NodeHash` should only enable sharing of suffixes. Maybe something
about the loss of locality of the FST suffix nodes, causing more bytes to refer
to them later? Confusing.
Another observation is that it takes quite a few RAM MB to bring the final
FST size close-ish to its optimal / minimal size (350.2 MB).
It's also curious how the FST Build time grows with a larger `NodeHash` --
maybe this is just the added cost of maintaining/cycling the double barrel hash
(and promoting entries from the "old" to the "new" barrel)?
I will try soonish to post a similar table from `main` (unbounded
`NodeHash`) for comparison to this approach by tuning the god-like knobs for
controlling RAM usage during FST compilation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
