[
https://issues.apache.org/jira/browse/LUCENE-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447282#comment-17447282
]
Dawid Weiss commented on LUCENE-10247:
--------------------------------------
Hi [~hendrikmuhs]! This sounds interesting - didn't look at the patch yet - but
I think the examination of performance should be twofold: the size of the
output dictionary and the performance impact of additional calculations. 5%
size improvement may not be worth it if there is a speed degradation there.
{code}
4.2 million key dictionary from 92MB ot 87MB, so saved 5MB, which is roughly
5.5%.
{code}
I recall we _did_ have space-related optimizations in Lucene at some point but
they were removed exactly for this reason (time it took to optimize/ serialize
and then search over the automaton vs. size benefits). You can take a look at
this ancient paper which discusses those size-reduction techniques, including
global graph node rearrangements to minimize pointer size:
https://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf
These are still used in Morfologik, by the way, so you could try to estimate
what kind of benefits you're looking at for your dictionaries:
https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/main/java/morfologik/fsa/builders/CFSA2Serializer.java
My experience with the above project is that in 99% of cases runtime speed is
much preferred over disk size (and this is achieved by simpler state encoding).
> Reduce FST size by using absolute and relative coding for target pointers
> -------------------------------------------------------------------------
>
> Key: LUCENE-10247
> URL: https://issues.apache.org/jira/browse/LUCENE-10247
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/FSTs
> Reporter: Hendrik Muhs
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> FST's use various tricks to reduce size. One more trick that can be added is
> using relative coding for the target pointers that connect nodes.
> Currently if a node has a target and it's not optimized using the `next`
> flag, it writes a VLong which contains the address of the child. This is an
> absolute address. A relative address can be shorter, relative address means
> taking the address of the node and expressing the result of child node as
> diff between the parent and the child. E.g. if the child is _10099_ and the
> parent {_}10000{_}, an absolute pointer requires 2 bytes, but the relative
> pointer ({_}10099 - 10000{_}) requires only 1 byte. Of course that's not
> always true, the absolute pointer could be smaller. Therefore the idea is to
> switch between the two, dependent on what's smaller.
> (FWIW: From a theory standpoint this works nicely, because the FST is
> constructed from the leafs to the root, child node addresses are always
> smaller than parent ones and due to good locality, connected nodes are
> usually stored close to each other )
> I have a prototype for this, it will be pushed as a draft PR and added as
> link here. It introduces a new bit in _flags_ and changes the compiler to use
> relative coding if possible. The lookup side needs the current node address
> and checks one more bit flag. Both additions seem reasonable cheap and do not
> add a lot of runtime.
> In my measurements I was able to reduce the size of a 4.2 million key
> dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you
> have better reference datasets, let me know. It obviously works better the
> more data.
> The POC implementation only supports 1 output type (positive int output) and
> I just made it compile. For this 1 type of dictionary it's possible to create
> and dump dictionaries correctly.
> Before spending more time on this I like to hear feedback and get agreement
> on whether this is something you like to have. It seems to me that a size
> reduction of FST's is always welcome and as pointed out, the runtime
> additions are reasonable small. If you wonder about the implementation, I
> happily discuss the nitty gritty details on gh, it had some interesting
> challenges.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]