[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931528#comment-16931528
 ] 

Bruno Roustant commented on LUCENE-8920:
----------------------------------------

{quote}list-encoding for small N, and consider open addressing as a variant 
within "doFixedArray"{quote}
Yes, sounds right.
{quote}single parameter: maximum memory cost [...] never encoded using more 
than L * the min possible{quote}
The min space possible is obtained with every node transitions encoded as a 
list with variable-length encoding. I don't see a way to determine the min 
space possible (except by encoding it, which would cost too much for the 
benefit) to give guarantees. It is however possible to compare to the "fixed 
array" encoding, but in this case we don't have a clear picture to give in the 
API contract.

I like the idea of a single parameter to work with. That's why I was thinking 
of a slider between most-compact to most-performant. In this case we don't give 
guarantees but best effort intent.

My next step is to work on the heuristics. Once we have that clear, we can 
start implementation.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Blocker
>             Fix For: 8.3
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to