[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970888#comment-16970888 ]
Bruno Roustant edited comment on LUCENE-8920 at 11/9/19 5:22 PM: ----------------------------------------------------------------- {quote}Out of curiosity, have you confirmed this? {quote} It's by design. The oversizing factor is just a multiplier. The rule is "encode with direct addressing if size-of-direct-addressing <= oversizing-factor x size-of-binary-search". So if oversizing factor is 1 we don't encode with direct addressing unless we reduce or equal the size. So the whole FST memory can only reduce, never increase. {quote}I worry that values greater than 1 might mostly make this new encoding used on nodes that don't have that many arcs {quote} There is still the old rule that encodes with fixed length arcs only nodes with enough arcs and with low depth in the tree (FST.shouldExpandNodeWithFixedLengthArcs()). {quote}so even something like a 10% increase could translate to hundreds of megabytes. {quote} This Jira issue is "reduce size of FST due to use of direct-addressing". I thought we were trying here to reduce the memory increase, not necessarily remove it completely. That said I understand the concern. If we ensure by default we finally don't increase at all the memory by controlling direct-addressing (so the perf goal is somewhat reduced), how easy will it be for anyone to use BlockTree posting format with a custom setting for FST ? Will it be an additional BlockTree posting format with different name and settings? I'll implement the memory reduction/oversizing credit to have more precise control. But the final decision for the default memory/perf balance should be taken based on some measures. [~jpountz] would you be able to have these measures? was (Author: bruno.roustant): {quote}Out of curiosity, have you confirmed this? {quote} It's by design. The oversizing factor is just a multiplier. The rule is "encode with direct addressing if size-of-direct-addressing <= oversizing-factor x size-of-binary-search". So if oversizing factor is 1 we don't encode with direct addressing unless we reduce or equal the size. So the whole FST memory can only reduce, never increase. {quote}I worry that values greater than 1 might mostly make this new encoding used on nodes that don't have that many arcs {quote} There is still the old rule that encodes with fixed length arcs only nodes with enough arcs and with sufficient depth in the tree (FST.shouldExpandNodeWithFixedLengthArcs()). {quote}so even something like a 10% increase could translate to hundreds of megabytes. {quote} This Jira issue is "reduce size of FST due to use of direct-addressing". I thought we were trying here to reduce the memory increase, not necessarily remove it completely. That said I understand the concern. If we ensure by default we finally don't increase at all the memory by controlling direct-addressing (so the perf goal is somewhat reduced), how easy will it be for anyone to use BlockTree posting format with a custom setting for FST ? Will it be an additional BlockTree posting format with different name and settings? I'll implement the memory reduction/oversizing credit to have more precise control. But the final decision for the default memory/perf balance should be taken based on some measures. [~jpountz] would you be able to have these measures? > Reduce size of FSTs due to use of direct-addressing encoding > ------------------------------------------------------------- > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael Sokolov > Priority: Minor > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org