[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970888#comment-16970888
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 11/9/19 5:22 PM:
-----------------------------------------------------------------

{quote}Out of curiosity, have you confirmed this?
{quote}
It's by design. The oversizing factor is just a multiplier. The rule is "encode 
with direct addressing if size-of-direct-addressing <= oversizing-factor x 
size-of-binary-search". So if oversizing factor is 1 we don't encode with 
direct addressing unless we reduce or equal the size. So the whole FST memory 
can only reduce, never increase.
{quote}I worry that values greater than 1 might mostly make this new encoding 
used on nodes that don't have that many arcs
{quote}
There is still the old rule that encodes with fixed length arcs only nodes with 
enough arcs and with low depth in the tree 
(FST.shouldExpandNodeWithFixedLengthArcs()).
{quote}so even something like a 10% increase could translate to hundreds of 
megabytes.
{quote}
This Jira issue is "reduce size of FST due to use of direct-addressing". I 
thought we were trying here to reduce the memory increase, not necessarily 
remove it completely. That said I understand the concern. If we ensure by 
default we finally don't increase at all the memory by controlling 
direct-addressing (so the perf goal is somewhat reduced), how easy will it be 
for anyone to use BlockTree posting format with a custom setting for FST ? Will 
it be an additional BlockTree posting format with different name and settings?

I'll implement the memory reduction/oversizing credit to have more precise 
control.
 But the final decision for the default memory/perf balance should be taken 
based on some measures. [~jpountz] would you be able to have these measures?

 


was (Author: bruno.roustant):
{quote}Out of curiosity, have you confirmed this?
{quote}
It's by design. The oversizing factor is just a multiplier. The rule is "encode 
with direct addressing if size-of-direct-addressing <= oversizing-factor x 
size-of-binary-search". So if oversizing factor is 1 we don't encode with 
direct addressing unless we reduce or equal the size. So the whole FST memory 
can only reduce, never increase.
{quote}I worry that values greater than 1 might mostly make this new encoding 
used on nodes that don't have that many arcs
{quote}
There is still the old rule that encodes with fixed length arcs only nodes with 
enough arcs and with sufficient depth in the tree 
(FST.shouldExpandNodeWithFixedLengthArcs()).
{quote}so even something like a 10% increase could translate to hundreds of 
megabytes.
{quote}
This Jira issue is "reduce size of FST due to use of direct-addressing". I 
thought we were trying here to reduce the memory increase, not necessarily 
remove it completely. That said I understand the concern. If we ensure by 
default we finally don't increase at all the memory by controlling 
direct-addressing (so the perf goal is somewhat reduced), how easy will it be 
for anyone to use BlockTree posting format with a custom setting for FST ? Will 
it be an additional BlockTree posting format with different name and settings?

I'll implement the memory reduction/oversizing credit to have more precise 
control.
But the final decision for the default memory/perf balance should be taken 
based on some measures. [~jpountz] would you be able to have these measures?

 

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Minor
>         Attachments: TestTermsDictRamBytesUsed.java
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to