[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

Bruno Roustant (Jira) Fri, 06 Sep 2019 04:53:52 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924178#comment-16924178
 ]


Bruno Roustant commented on LUCENE-8920:
----------------------------------------

I'd love to work on that, but I'm pretty busy so I can't start immediately. If 
you can start on it soon I'll be happy to help and review.

I'll try to think more about the subject. Where should I post my remarks/ideas? 
Here in the thread or in an attached doc?

Some additional thoughts:
 * Threshold T1 to find to decide when direct-addressing is best (N / (max 
label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? 
(although there is the var length encoding difference...). Did you try that, 
what is the perf?
 * Threshold T2 to find to decide if a list is better (N < T2) or if 
open-addressing is more appropriate.
 * If N is close to 2^p, the probability that open-addressing aborts (can't 
store a label in less than L tries) is high. Do we double the array size 
(2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, 
but need some testing about the load factor)
 * I think var-length List and fixed-length Binary-Search options could be 
merged to always have a var-length List that can be binary searched with low 
impact on perf. This is a work in itself, but it can help reduce the FST memory 
and thus free some bytes for the faster options.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Blocker
>             Fix For: 8.3
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

Reply via email to