[jira] [Commented] (LUCENE-10247) Reduce FST size by using absolute and relative coding for target pointers

Dawid Weiss (Jira) Mon, 22 Nov 2021 01:52:32 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447282#comment-17447282
 ]


Dawid Weiss commented on LUCENE-10247:
--------------------------------------

Hi [~hendrikmuhs]! This sounds interesting - didn't look at the patch yet - but 
I think the examination of performance should be twofold: the size of the 
output dictionary and the performance impact of additional calculations. 5% 
size improvement may not be worth it if there is a speed degradation there.
{code}
4.2 million key dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 
5.5%.
{code}

I recall we _did_ have space-related optimizations in Lucene at some point but 
they were removed exactly for this reason (time it took to optimize/ serialize 
and then search over the automaton vs. size benefits). You can take a look at 
this ancient paper which discusses those size-reduction techniques, including 
global graph node rearrangements to minimize pointer size:

https://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf

These are still used in Morfologik, by the way, so you could try to estimate 
what kind of benefits you're looking at for your dictionaries:

https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/main/java/morfologik/fsa/builders/CFSA2Serializer.java

My experience with the above project is that in 99% of cases runtime speed is 
much preferred over disk size (and this is achieved by simpler state encoding).

> Reduce FST size by using absolute and relative coding for target pointers
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-10247
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10247
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Hendrik Muhs
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> FST's use various tricks to reduce size. One more trick that can be added is 
> using relative coding for the target pointers that connect nodes.
> Currently if a node has a target and it's not optimized using the `next` 
> flag, it writes a VLong which contains the address of the child. This is an 
> absolute address. A relative address can be shorter, relative address means 
> taking the address of the node and expressing the result of child node as 
> diff between the parent and the child. E.g. if the child is _10099_ and the 
> parent {_}10000{_}, an absolute pointer requires 2 bytes, but the relative 
> pointer ({_}10099 - 10000{_}) requires only 1 byte. Of course that's not 
> always true, the absolute pointer could be smaller. Therefore the idea is to 
> switch between the two, dependent on what's smaller.
> (FWIW: From a theory standpoint this works nicely, because the FST is 
> constructed from the leafs to the root, child node addresses are always 
> smaller than parent ones and due to good locality, connected nodes are 
> usually stored close to each other )
> I have a prototype for this, it will be pushed as a draft PR and added as 
> link here. It introduces a new bit in _flags_ and changes the compiler to use 
> relative coding if possible. The lookup side needs the current node address 
> and checks one more bit flag. Both additions seem reasonable cheap and do not 
> add a lot of runtime.
> In my measurements I was able to reduce the size of a 4.2 million key 
> dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you 
> have better reference datasets, let me know. It obviously works better the 
> more data.
> The POC implementation only supports 1 output type (positive int output) and 
> I just made it compile. For this 1 type of dictionary it's possible to create 
> and dump dictionaries correctly.
> Before spending more time on this I like to hear feedback and get agreement 
> on whether this is something you like to have. It seems to me that a size 
> reduction of FST's is always welcome and as pointed out, the runtime 
> additions are reasonable small. If you wonder about the implementation, I 
> happily discuss the nitty gritty details on gh, it had some interesting 
> challenges.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-10247) Reduce FST size by using absolute and relative coding for target pointers

Reply via email to