[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970444#comment-16970444
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 11/8/19 5:36 PM:
-----------------------------------------------------------------

It works. PR#980 updated. I removed the labels for direct-addressing, except 
the first label of the arc.
 It makes direct-addressing encoding even more compact. Now we have 48% of the 
nodes with fixed length arcs encoded with direct-encoding by using only +12% 
memory (instead of +23% with my previous patch).

I also did several things:
 The presence bits table is now a reused structure (does not create long[] 
anymore each time we read an arc).
 I touched more code, but I think the code is now a bit clearer with more 
comments.
 I moved the bit twiddling methods to BitUtil.

Thanks reviewers!

 

I noticed the direct-addressing limitation: it is less performant than binary 
search for reverse lookup (lookup by output).

Is there a periodic automated monitoring of the performance / memory of 
commits? Could you give me the link? I would like to watch the change with this 
PR.

 

 I definitely would like to clean more FST. But that will be the subject of 
another Jira issue later. Too much duplicated code that makes reasoning and 
modifications hard. Also missing docs and method contracts (e.g. expected 
position in the byte input).


was (Author: bruno.roustant):
It works. I removed the labels for direct-addressing, except the first label of 
the arc.
 It makes direct-addressing encoding even more compact. Now we have 48% of the 
nodes with fixed length arcs encoded with direct-encoding by using only +12% 
memory (instead of +23% with my previous patch).

I also did several things:
 The presence bits table is now a reused structure (does not create long[] 
anymore each time we read an arc).
 I touched more code, but I think the code is now a bit clearer with more 
comments.
 I moved the bit twiddling methods to BitUtil.

Thanks reviewers!

 

I noticed the direct-addressing limitation: it is less performant than binary 
search for reverse lookup (lookup by output).

Is there a periodic automated monitoring of the performance / memory of 
commits? Could you give me the link? I would like to watch the change with this 
PR.

 

 I definitely would like to clean more FST. But that will be the subject of 
another Jira issue later. Too much duplicated code that makes reasoning and 
modifications hard. Also missing docs and method contracts (e.g. expected 
position in the byte input).

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Minor
>         Attachments: TestTermsDictRamBytesUsed.java
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to