Hendrik Muhs created LUCENE-10247:
-------------------------------------
Summary: Reduce FST size by using absolute and relative coding for
target pointers
Key: LUCENE-10247
URL: https://issues.apache.org/jira/browse/LUCENE-10247
Project: Lucene - Core
Issue Type: Improvement
Components: core/FSTs
Reporter: Hendrik Muhs
FST's use various tricks to reduce size. One more trick that can be added is
using relative coding for the target pointers that connect nodes.
Currently if a node has a target and it's not optimized using the `next` flag,
it writes a VLong which contains the address of the child. This is an absolute
address. A relative address can be shorter, relative address means taking the
address of the node and expressing the result of child node as diff between the
parent and the child. E.g. if the child is _10099_ and the parent {_}10000{_},
an absolute pointer requires 2 bytes, but the relative pointer ({_}10099 -
10000{_}) requires only 1 byte. Of course that's not always true, the absolute
pointer could be smaller. Therefore the idea is to switch between the two,
dependent on what's smaller.
(FWIW: From a theory standpoint this works nicely, because the FST is
constructed from the leafs to the root, child node addresses are always smaller
than parent ones and due to good locality, connected nodes are usually stored
close to each other )
I have a prototype for this, it will be pushed as a draft PR and added as link
here. It introduces a new bit in _flags_ and changes the compiler to use
relative coding if possible. The lookup side needs the current node address and
checks one more bit flag. Both additions seem reasonable cheap and do not add a
lot of runtime.
In my measurements I was able to reduce the size of a 4.2 million key
dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you have
better reference datasets, let me know. It obviously works better the more data.
The POC implementation only supports 1 output type (positive int output) and I
just made it compile. For this 1 type of dictionary it's possible to create and
dump dictionaries correctly.
Before spending more time on this I like to hear feedback and get agreement on
whether this is something you like to have. It seems to me that a size
reduction of FST's is always welcome and as pointed out, the runtime additions
are reasonable small. If you wonder about the implementation, I happily discuss
the nitty gritty details on gh, it had some interesting challenges.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]