[
https://issues.apache.org/jira/browse/LUCENE-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447609#comment-17447609
]
Hendrik Muhs commented on LUCENE-10247:
---------------------------------------
Thanks for the 1st feedback.
> and there are some typos and todos left
That's an understatement ;-) It's a POC, there is tons of work left to make
this work for _all_ dictionary types and with proper testing. For some parts I
probably need help, e.g. regarding BWC. But if there is a thumbs up for the
idea, I happily go for it.
> Reduce FST size by using absolute and relative coding for target pointers
> -------------------------------------------------------------------------
>
> Key: LUCENE-10247
> URL: https://issues.apache.org/jira/browse/LUCENE-10247
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/FSTs
> Reporter: Hendrik Muhs
> Priority: Major
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> FST's use various tricks to reduce size. One more trick that can be added is
> using relative coding for the target pointers that connect nodes.
> Currently if a node has a target and it's not optimized using the `next`
> flag, it writes a VLong which contains the address of the child. This is an
> absolute address. A relative address can be shorter, relative address means
> taking the address of the node and expressing the result of child node as
> diff between the parent and the child. E.g. if the child is _10099_ and the
> parent {_}10000{_}, an absolute pointer requires 2 bytes, but the relative
> pointer ({_}10099 - 10000{_}) requires only 1 byte. Of course that's not
> always true, the absolute pointer could be smaller. Therefore the idea is to
> switch between the two, dependent on what's smaller.
> (FWIW: From a theory standpoint this works nicely, because the FST is
> constructed from the leafs to the root, child node addresses are always
> smaller than parent ones and due to good locality, connected nodes are
> usually stored close to each other )
> I have a prototype for this, it will be pushed as a draft PR and added as
> link here. It introduces a new bit in _flags_ and changes the compiler to use
> relative coding if possible. The lookup side needs the current node address
> and checks one more bit flag. Both additions seem reasonable cheap and do not
> add a lot of runtime.
> In my measurements I was able to reduce the size of a 4.2 million key
> dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you
> have better reference datasets, let me know. It obviously works better the
> more data.
> The POC implementation only supports 1 output type (positive int output) and
> I just made it compile. For this 1 type of dictionary it's possible to create
> and dump dictionaries correctly.
> Before spending more time on this I like to hear feedback and get agreement
> on whether this is something you like to have. It seems to me that a size
> reduction of FST's is always welcome and as pointed out, the runtime
> additions are reasonable small. If you wonder about the implementation, I
> happily discuss the nitty gritty details on gh, it had some interesting
> challenges.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]