[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930601#comment-16930601 ] Bruno Roustant edited comment on LUCENE-8920 at 9/16/19 2:26 PM: - I tried a quick benchmark to evaluate the load factor (num labels / array size) we can expect for open-addressing. This gives us a good estimate of the space requirement. This load factor depends on the number of tries (called L above) we accept - this defines the worst case perf. For each power-of-two array size, the benchmark attempts various L, and outputs the max load factor (above this load factor, the open-addressing aborts and we fallback to binary search) (benchmark print in next post below). Outcome: * Load factor slightly above 50% for good perf (for L = log(N)/2 + 1), as expected for open-addressing in literature. * It is also possible to use a power-of-two x 1.5 array size with efficient hash (to stay at load factor 50% in all cases). * We have to encode the “empty slot” (no label in a given array slot). Probably with both a 0 label and 0 value (if the node needs that, then abort and fallback to binary search). This means we have to expect a space increase of 2x (compared to binary search), for better perf than binary search (L = log(N)/2 + 1, which is the worst case, most of the time the open-addressing stops when encountering an empty slot before L). To me this balance between space and performance cannot be hardcoded. This depends on the use-case. There should be a balance tuning parameter in the FST constructor (e.g. max-perf, perf-over-space, space-over-perf, min-space). And based on this balance we could set the values of a couple of thresholds that define when to use direct-addressing, open-addressing, binary-search, compact-list. was (Author: bruno.roustant): I tried a quick benchmark to evaluate the load factor (num labels / array size) we can expect for open-addressing. This gives us a good estimate of the space requirement. This load factor depends on the number of tries (called L above) we accept - this defines the worst case perf. For each power-of-two array size, the benchmark tries various L, and outputs the max load factor (above this load factor, the open-addressing aborts and we fallback to binary search) (benchmark print in next post below). Outcome: * Load factor slightly above 50% for good perf (for L = log(N)/2 + 1), as expected for open-addressing in literature. * It is also possible to use a power-of-two x 1.5 array size with efficient hash (to stay at load factor 50% in all cases). * We have to encode the “empty slot” (no label in a given array slot). Probably with both a 0 label and 0 value (if the node needs that, then abort and fallback to binary search). This means we have to expect a space increase of 2x (compared to binary search), for better perf than binary search (L = log(N)/2 + 1, which is the worst case, most of the time the open-addressing stops when encountering an empty slot before L). To me this balance between space and performance cannot be hardcoded. This depends on the use-case. There should be a balance tuning parameter in the FST constructor (e.g. max-perf, perf-over-space, space-over-perf, min-space). And based on this balance we could set the values of a couple of thresholds that define when to use direct-addressing, open-addressing, binary-search, compact-list. > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mike Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by A
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930608#comment-16930608 ] Bruno Roustant edited comment on LUCENE-8920 at 9/16/19 2:29 PM: - Open-addressing benchmark to store byte labels in an array of size between 8 and 256. "Array" means array size. "tries" means max number of slot comparisons allowed (when a slot we want to put a label to is not empty - move to the next slot). "loadFactor" = max load factor (num labels / array size) for which we can add all the labels by respecting "tries". Above this max load factor, the open-addressing fails to store all labels and fallbacks to binary search. This max load factor is an average of the max load factor for 1000 random constructions done by the benchmark. "memory" = space increase factor = 1 / loadFactor. Array=8, tries=1, loadFactor=0.400125, memory x 2.499219 Array=8, tries=2, loadFactor=0.63825, memory x 1.5667841 Array=8, tries=3, loadFactor=0.78675, memory x 1.2710518 Array=12, tries=1, loadFactor=0.34391674, memory x 2.9076805 Array=12, tries=2, loadFactor=0.54183316, memory x 1.8455865 Array=12, tries=3, loadFactor=0.6964172, memory x 1.4359208 Array=16, tries=1, loadFactor=0.29825, memory x 3.352892 Array=16, tries=2, loadFactor=0.5039375, memory x 1.9843731 Array=16, tries=3, loadFactor=0.646625, memory x 1.5464914 Array=16, tries=4, loadFactor=0.7493125, memory x 1.3345567 Array=24, tries=1, loadFactor=0.25141662, memory x 3.9774618 Array=24, tries=2, loadFactor=0.44929162, memory x 2.225726 Array=24, tries=3, loadFactor=0.58083373, memory x 1.7216631 Array=24, tries=4, loadFactor=0.6867924, memory x 1.4560441 Array=32, tries=1, loadFactor=0.22528125, memory x 4.4388957 Array=32, tries=2, loadFactor=0.4059375, memory x 2.4634335 Array=32, tries=3, loadFactor=0.53625, memory x 1.8648019 Array=32, tries=4, loadFactor=0.63653123, memory x 1.5710148 Array=32, tries=5, loadFactor=0.7165625, memory x 1.3955517 Array=48, tries=1, loadFactor=0.1885, memory x 5.2910066 Array=48, tries=2, loadFactor=0.36193773, memory x 2.762906 Array=48, tries=3, loadFactor=0.49154142, memory x 2.0344167 Array=48, tries=4, loadFactor=0.5871248, memory x 1.7032154 Array=48, tries=5, loadFactor=0.6700212, memory x 1.4924902 Array=64, tries=1, loadFactor=0.17101562, memory x 5.8474193 Array=64, tries=2, loadFactor=0.33707812, memory x 2.9666712 Array=64, tries=3, loadFactor=0.46217188, memory x 2.1636972 Array=64, tries=4, loadFactor=0.56409377, memory x 1.7727549 Array=64, tries=5, loadFactor=0.6392813, memory x 1.5642567 Array=64, tries=6, loadFactor=0.6963125, memory x 1.4361368 Array=96, tries=1, loadFactor=0.15294789, memory x 6.5381746 Array=96, tries=2, loadFactor=0.31204194, memory x 3.2046974 Array=96, tries=3, loadFactor=0.42829168, memory x 2.3348575 Array=96, tries=4, loadFactor=0.5277291, memory x 1.8949116 Array=96, tries=5, loadFactor=0.60920805, memory x 1.6414753 Array=96, tries=6, loadFactor=0.66278124, memory x 1.5087935 Array=128, tries=1, loadFactor=0.15130469, memory x 6.6091805 Array=128, tries=2, loadFactor=0.3054219, memory x 3.2741597 Array=128, tries=3, loadFactor=0.43253124, memory x 2.3119717 Array=128, tries=4, loadFactor=0.5283203, memory x 1.8927912 Array=128, tries=5, loadFactor=0.6043203, memory x 1.6547517 Array=128, tries=6, loadFactor=0.66267186, memory x 1.5090425 Array=128, tries=7, loadFactor=0.70567185, memory x 1.4170892 Array=192, tries=1, loadFactor=0.14355215, memory x 6.9661093 Array=192, tries=2, loadFactor=0.32579714, memory x 3.0693946 Array=192, tries=3, loadFactor=0.42888057, memory x 2.3316514 Array=192, tries=4, loadFactor=0.5238486, memory x 1.9089485 Array=192, tries=5, loadFactor=0.59191173, memory x 1.6894411 Array=192, tries=6, loadFactor=0.65411395, memory x 1.5287856 Array=192, tries=7, loadFactor=0.6975258, memory x 1.4336387 Array=256, tries=1, loadFactor=1.0, memory x 1.0 Array=256, tries=2, loadFactor=1.0, memory x 1.0 Array=256, tries=3, loadFactor=1.0, memory x 1.0 Array=256, tries=4, loadFactor=1.0, memory x 1.0 Array=256, tries=5, loadFactor=1.0, memory x 1.0 Array=256, tries=6, loadFactor=1.0, memory x 1.0 Array=256, tries=7, loadFactor=1.0, memory x 1.0 Array=256, tries=8, loadFactor=1.0, memory x 1.0 was (Author: bruno.roustant): Open-addressing benchmark to store byte labels in an array of size between 8 and 256. "Array" means array size. "tries" means max number of slot comparisons allowed (when a slot we want to put a label to is not empty - move to the next slot). "loadFactor" = max load factor (num labels / array size) for which we can add all the labels by respecting "tries". Above this max load factor, the open-addressing fails to store all labels and fallbacks to binary search. This max load factor is an average of 1000 random constructions done b
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931367#comment-16931367 ] Mike Sokolov edited comment on LUCENE-8920 at 9/17/19 12:06 PM: This is cool. Regarding the strategy for which encoding to apply, I'll just call out the current heuristics: {{doFixedArray = node depth (distance from root node) <= 3 and N >= 5) or N >= 10}} {{writeDirectly = doFixedArray && (max_label - min_label) < 4 * N}} {{I think we can still consider that we would apply list-encoding for small N, and consider open addressing as a variant within "doFixedArray," where we now can choose among direct addressing (for least load factors L), open addressing (for intermediate case), and binary search for highest L. Does that sound right?}} {{I wonder if we could work backwards from a single parameter L: maximum memory cost (vs list encoding). The API would guarantee that no set of arcs is ever encoded using more than L * the minimum possible, and then internally we choose the best (ie fastest lookup) encoding that achieves that, possibly with some tweak for higher-order arcs (ie near the root).}} was (Author: sokolov): This is cool. Regarding the strategy for which encoding to apply, I'll just call out the current heuristics: {{doFixedArray = }}{{(node depth (distance from root node) <= 3 and N >= 5) or N >= 10}} {{writeDirectly = doFixedArray && (max_label - min_label) < 4 * N}} {{I think we can still consider that we would apply list-encoding for small N, and consider open addressing as a variant within "doFixedArray," where we now can choose among direct addressing (for least load factors L), open addressing (for intermediate case), and binary search for highest L. Does that sound right?}} {{I wonder if we could work backwards from a single parameter L: maximum memory cost (vs list encoding). The API would guarantee that no set of arcs is ever encoded using more than L * the minimum possible, and then internally we choose the best (ie fastest lookup) encoding that achieves that, possibly with some tweak for higher-order arcs (ie near the root).}} > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mike Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949647#comment-16949647 ] Michael Sokolov edited comment on LUCENE-8920 at 10/11/19 5:13 PM: --- {{For posterity, this is the worst case test that spreads out terms}} for (int i = 0; i < 100; ++i) { byte[] b = new byte[5]; random().nextBytes(b); for (int j = 0; j < b.length; ++j) { b[j] &= 0xfc; // make this byte a multiple of 4 } entries.add(new BytesRef(b)); } buildFST(entries).ramBytesUsed(); was (Author: sokolov): {{For posterity, this is the worst case test that spreads out terms}} {{}}for (int i = 0; i < 100; ++i) { byte[] b = new byte[5]; random().nextBytes(b); for (int j = 0; j < b.length; ++j) { b[j] &= 0xfc; // make this byte a multiple of 4 } entries.add(new BytesRef(b)); } buildFST(entries).ramBytesUsed(); > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949647#comment-16949647 ] Michael Sokolov edited comment on LUCENE-8920 at 10/11/19 5:14 PM: --- For posterity, this is the worst case test that spreads out terms {{ for (int i = 0; i < 100; ++i) { byte[] b = new byte[5]; random().nextBytes(b); for (int j = 0; j < b.length; ++j){ b[j] &= 0xfc; // make this byte a multiple of 4 } entries.add(new BytesRef(b)); } buildFST(entries).ramBytesUsed();}} was (Author: sokolov): {{For posterity, this is the worst case test that spreads out terms}} for (int i = 0; i < 100; ++i) { byte[] b = new byte[5]; random().nextBytes(b); for (int j = 0; j < b.length; ++j) { b[j] &= 0xfc; // make this byte a multiple of 4 } entries.add(new BytesRef(b)); } buildFST(entries).ramBytesUsed(); > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951033#comment-16951033 ] Bruno Roustant edited comment on LUCENE-8920 at 10/14/19 2:41 PM: -- Update about my try with open-addressing. In fact open-addressing is not compatible with FST contract. Indeed FST not only needs a mapping label-to-value, but also requires the ordering of label-value entries (for FSTEnum for example to get the next arc). Open-addressing does not keep the ordering. Dead-end, I missed that point. So I'd like to continue helping on this, by leveraging [~jpountz]'s test and trying to find a heuristic that takes advantage of direct-addressing while limiting the memory overhead. Medium term I'd like to explore another option: allow binary search on variable length list. If we can do this efficiently, maybe we could reduce the memory and still have good perf, potentially to buy memory to use more direct-addressing. The idea is to encode differently the VInt/VLong specially for FST key-value entries. Currently VInt encoding uses a bit to tell if there is a next additional byte. We could use this bit instead to flag the label bytes (let's say this bit is 0), and to flag differently the value bytes (this bit is 1). Let's call it keyflag bit. So the new VInt-for-FST encoding is a series of bytes having the same keyflag bit until this bit changes (and the last byte read is ignored). This does not require more memory, and this allows binary search: jump to the middle of the whole byte range of the node (indistinctly keys and values), there read the keyflag and move next/previous until the first byte of a key is found, then read the VInt-for-FST. And so on like a binary search. There is a little byte reading overhead compared to fixed length binary search, but it is more compact so maybe more IO cache hit on average. So perf to be evaluated. [~mikemccand] what is your opinion on the above idea? Could it fix LUCENE-4682 if it shows good enough perf? was (Author: bruno.roustant): Update about my try with open-addressing. In fact open-addressing is not compatible with FST contract. Indeed FST not only needs a mapping label-to-value, but also requires the ordering of label-value entries (for FSTEnum for example to get the next arc). Open-addressing does not keep the ordering. Dead-end, I missed that point. So I'd like to continue helping on this, by leveraging [~jpountz]'s test and trying to find a heuristic that takes advantage of direct-addressing while limiting the memory overhead. Medium term I'd like to explore another option: allow binary search on variable length list. If we can do this efficiently, maybe we could reduce the memory and still have good perf, potentially to buy memory to use more direct-addressing. The idea is to encode differently the VInt/VLong specially for FST key-value entries. Currently VInt encoding uses a bit to tell if there is a next additional byte. We could use this bit instead to flag the label bytes (let's say this bit is 0), and to flag differently the value bytes (this bit is 1). Let's call it keyflag bit. So the new VInt-for-FST encoding is a series of bytes having the same keyflag bit until this bit changes (and the last byte read is ignored). This does not require more memory, and this allows binary search: jump to the middle of the whole byte range of the node (indistinctly keys and values), there read the keyflag and move next/previous until the first byte of a key is found, then read the VInt-for-FST. And so on like a binary search. There is a little byte reading overhead compared to fixed length binary search, but it is more compact so maybe more IO cache hit on average. So perf to be evaluated. > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Blocker > Fix For: 8.3 > > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 1.5h > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a d
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970444#comment-16970444 ] Bruno Roustant edited comment on LUCENE-8920 at 11/8/19 5:35 PM: - It works. I removed the labels for direct-addressing, except the first label of the arc. It makes direct-addressing encoding even more compact. Now we have 48% of the nodes with fixed length arcs encoded with direct-encoding by using only +12% memory (instead of +23% with my previous patch). I also did several things: The presence bits table is now a reused structure (does not create long[] anymore each time we read an arc). I touched more code, but I think the code is now a bit clearer with more comments. I moved the bit twiddling methods to BitUtil. Thanks reviewers! I noticed the direct-addressing limitation: it is less performant than binary search for reverse lookup (lookup by output). Is there a periodic automated monitoring of the performance / memory of commits? Could you give me the link? I would like to watch the change with this PR. I definitely would like to clean more FST. But that will be the subject of another Jira issue later. Too much duplicated code that makes reasoning and modifications hard. Also missing docs and method contracts (e.g. expected position in the byte input). was (Author: bruno.roustant): It works. I removed the labels for direct-addressing, except the first label of the arc. It makes direct-addressing encoding even more compact. Now we have 48% of the nodes with fixed length arcs encoded with direct-encoding by using only +12% memory (instead of +23% with my previous patch). I also did several things: The presence bits table is now a reused structure (does not create long[] anymore each time we read an arc). I touched more code, but I think the code is now a bit clearer with more comments. I moved the bit twiddling methods to BitUtil. Thanks reviewers! I noticed the direct-addressing limitation: it is less performant than binary search for reverse lookup (lookup by output). Is there an periodic automated monitoring of the performance / memory of commits? Could you give me the link? I would like to watch the change with this PR. I definitely would like to clean more FST. But that will be the subject of another Jira issue later. Too much duplicated code that makes reasoning and modifications hard. Also missing docs and method contracts (e.g. expected position in the byte input). > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Minor > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970444#comment-16970444 ] Bruno Roustant edited comment on LUCENE-8920 at 11/8/19 5:36 PM: - It works. PR#980 updated. I removed the labels for direct-addressing, except the first label of the arc. It makes direct-addressing encoding even more compact. Now we have 48% of the nodes with fixed length arcs encoded with direct-encoding by using only +12% memory (instead of +23% with my previous patch). I also did several things: The presence bits table is now a reused structure (does not create long[] anymore each time we read an arc). I touched more code, but I think the code is now a bit clearer with more comments. I moved the bit twiddling methods to BitUtil. Thanks reviewers! I noticed the direct-addressing limitation: it is less performant than binary search for reverse lookup (lookup by output). Is there a periodic automated monitoring of the performance / memory of commits? Could you give me the link? I would like to watch the change with this PR. I definitely would like to clean more FST. But that will be the subject of another Jira issue later. Too much duplicated code that makes reasoning and modifications hard. Also missing docs and method contracts (e.g. expected position in the byte input). was (Author: bruno.roustant): It works. I removed the labels for direct-addressing, except the first label of the arc. It makes direct-addressing encoding even more compact. Now we have 48% of the nodes with fixed length arcs encoded with direct-encoding by using only +12% memory (instead of +23% with my previous patch). I also did several things: The presence bits table is now a reused structure (does not create long[] anymore each time we read an arc). I touched more code, but I think the code is now a bit clearer with more comments. I moved the bit twiddling methods to BitUtil. Thanks reviewers! I noticed the direct-addressing limitation: it is less performant than binary search for reverse lookup (lookup by output). Is there a periodic automated monitoring of the performance / memory of commits? Could you give me the link? I would like to watch the change with this PR. I definitely would like to clean more FST. But that will be the subject of another Jira issue later. Too much duplicated code that makes reasoning and modifications hard. Also missing docs and method contracts (e.g. expected position in the byte input). > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Minor > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970888#comment-16970888 ] Bruno Roustant edited comment on LUCENE-8920 at 11/9/19 5:22 PM: - {quote}Out of curiosity, have you confirmed this? {quote} It's by design. The oversizing factor is just a multiplier. The rule is "encode with direct addressing if size-of-direct-addressing <= oversizing-factor x size-of-binary-search". So if oversizing factor is 1 we don't encode with direct addressing unless we reduce or equal the size. So the whole FST memory can only reduce, never increase. {quote}I worry that values greater than 1 might mostly make this new encoding used on nodes that don't have that many arcs {quote} There is still the old rule that encodes with fixed length arcs only nodes with enough arcs and with low depth in the tree (FST.shouldExpandNodeWithFixedLengthArcs()). {quote}so even something like a 10% increase could translate to hundreds of megabytes. {quote} This Jira issue is "reduce size of FST due to use of direct-addressing". I thought we were trying here to reduce the memory increase, not necessarily remove it completely. That said I understand the concern. If we ensure by default we finally don't increase at all the memory by controlling direct-addressing (so the perf goal is somewhat reduced), how easy will it be for anyone to use BlockTree posting format with a custom setting for FST ? Will it be an additional BlockTree posting format with different name and settings? I'll implement the memory reduction/oversizing credit to have more precise control. But the final decision for the default memory/perf balance should be taken based on some measures. [~jpountz] would you be able to have these measures? was (Author: bruno.roustant): {quote}Out of curiosity, have you confirmed this? {quote} It's by design. The oversizing factor is just a multiplier. The rule is "encode with direct addressing if size-of-direct-addressing <= oversizing-factor x size-of-binary-search". So if oversizing factor is 1 we don't encode with direct addressing unless we reduce or equal the size. So the whole FST memory can only reduce, never increase. {quote}I worry that values greater than 1 might mostly make this new encoding used on nodes that don't have that many arcs {quote} There is still the old rule that encodes with fixed length arcs only nodes with enough arcs and with sufficient depth in the tree (FST.shouldExpandNodeWithFixedLengthArcs()). {quote}so even something like a 10% increase could translate to hundreds of megabytes. {quote} This Jira issue is "reduce size of FST due to use of direct-addressing". I thought we were trying here to reduce the memory increase, not necessarily remove it completely. That said I understand the concern. If we ensure by default we finally don't increase at all the memory by controlling direct-addressing (so the perf goal is somewhat reduced), how easy will it be for anyone to use BlockTree posting format with a custom setting for FST ? Will it be an additional BlockTree posting format with different name and settings? I'll implement the memory reduction/oversizing credit to have more precise control. But the final decision for the default memory/perf balance should be taken based on some measures. [~jpountz] would you be able to have these measures? > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Minor > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971138#comment-16971138 ] Bruno Roustant edited comment on LUCENE-8920 at 11/10/19 3:03 PM: -- I added the expansion credit to the PR#980. This indeed gives more control on the oversizing factor and the result is better than I anticipated. All the previous measures I did were for the worst case for direct addressing. This time I also measured the FST size when building from an English dictionary of 479K words (including alphanumeric). I used direct addressing oversizing factor = 1 (ensures no oversizing on average) For English words I measured 217K nodes, only 3.27% nodes (top nodes in the FST) are encoded with fixed length arcs, and 99.99% of them with direct addressing. Overall FST memory *reduced* by 1.67%. For worst case I measured 168K nodes, 50% nodes are encoded with fixed length arcs, and 14% of them with direct encoding. Overall FST memory *reduced* by 0.8%. I’m confident that with this last PR we will not increase the FST memory (compared to without direct addressing). At the same time the top nodes with many arcs will most often be encoded with direct addressing. And the worst case is controlled so we still ensure no memory increase (while still trying to use direct addressing when favorable). was (Author: bruno.roustant): I added the expansion credit to the PR#980. This indeed gives more control on the oversizing factor and the result is better than I anticipated. All the previous measures I did were for the worst case for direct addressing. This time I also measured the FST size when building from an English dictionary of 479K words (including alphanumeric). I used direct addressing oversizing factor = 1 (ensures no oversizing on average) For English words I measured 217K nodes, only 3.27% nodes (top nodes in the FST) are encoded with fixed length arcs, and 99.99% of them with direct addressing. Overall FST memory *reduced* by 1.67%. For worst case I measured 168K nodes, 50% of them are encoded with fixed length arcs, and 14% of them with direct encoding. Overall FST memory *reduced* by 0.8%. I’m confident that with this last PR we will not increase the FST memory (compared to without direct addressing). At the same time the top nodes with many arcs will most often be encoded with direct addressing. And the worst case is controlled so we still ensure no memory increase (while still trying to use direct addressing when favorable). > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Minor > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971138#comment-16971138 ] Bruno Roustant edited comment on LUCENE-8920 at 11/10/19 3:05 PM: -- I added the expansion credit to the PR#980. This indeed gives more control on the oversizing factor and the result is better than I anticipated. All the previous measures I did were for the worst case for direct addressing. This time I also measured the FST size when building from an English dictionary of 479K words (including alphanumeric). I used direct addressing oversizing factor = 1 (ensures no oversizing on average) For English words I measured 217K nodes, only 3.27% nodes (top nodes in the FST) are encoded with fixed length arcs, and 99.99% of them (of the 3.27%) with direct addressing. Overall FST memory *reduced* by 1.67%. For worst case I measured 168K nodes, 50% nodes are encoded with fixed length arcs, and 14% of them (of the 50%) with direct addressing. Overall FST memory *reduced* by 0.8%. I’m confident that with this last PR we will not increase the FST memory (compared to without direct addressing). At the same time the top nodes with many arcs will most often be encoded with direct addressing. And the worst case is controlled so we still ensure no memory increase (while still trying to use direct addressing when favorable). was (Author: bruno.roustant): I added the expansion credit to the PR#980. This indeed gives more control on the oversizing factor and the result is better than I anticipated. All the previous measures I did were for the worst case for direct addressing. This time I also measured the FST size when building from an English dictionary of 479K words (including alphanumeric). I used direct addressing oversizing factor = 1 (ensures no oversizing on average) For English words I measured 217K nodes, only 3.27% nodes (top nodes in the FST) are encoded with fixed length arcs, and 99.99% of them with direct addressing. Overall FST memory *reduced* by 1.67%. For worst case I measured 168K nodes, 50% nodes are encoded with fixed length arcs, and 14% of them with direct encoding. Overall FST memory *reduced* by 0.8%. I’m confident that with this last PR we will not increase the FST memory (compared to without direct addressing). At the same time the top nodes with many arcs will most often be encoded with direct addressing. And the worst case is controlled so we still ensure no memory increase (while still trying to use direct addressing when favorable). > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Minor > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972414#comment-16972414 ] Michael Sokolov edited comment on LUCENE-8920 at 11/12/19 1:53 PM: --- +1 for merging, and handling {{cachedRootArcs}} separately was (Author: sokolov): +1 for merging, and handling {cachedRootArcs} separately > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Minor > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 4h 20m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974198#comment-16974198 ] Michael Sokolov edited comment on LUCENE-8920 at 11/14/19 12:38 PM: I scanned the diff of all the commits together, and I didn't see any issues, but boy that is a big change now. Thanks for handling the merge, [~jpountz], and nice work on saving both space and time, [~bruno.roustant]! was (Author: sokolov): I scanned the diff of all the commits together, and I didn't see any issues, but boy that is a big change now. Thanks for handling the merge, [~jpountz] > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Minor > Fix For: 8.4 > > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 4.5h > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974569#comment-16974569 ] Michael Sokolov edited comment on LUCENE-8920 at 11/14/19 8:05 PM: --- I ran the `luceneutil` test just to be sure: yes, 8.x can read indexes created by 8.3 was (Author: sokolov): I'll run the `luceneutil` test just to be sure > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Minor > Fix For: 8.4 > > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 4.5h > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974723#comment-16974723 ] Chris M. Hostetter edited comment on LUCENE-8920 at 11/15/19 1:19 AM: -- TestFstDirectAddressing.testDeDupTails has a 44% failure rate (11/25 runs) on jenkins boxes since it was added by this commit (so far only on Uwe's machine, but on multiple OSes, and both master and 8.x) was (Author: hossman): TestFstDirectAddressing.testDeDupTails has a 44% failure rate on jenkins boxes since it was added by this commit (so far only on Uwe's machine, but on multiple OSes, and both master and 8.x) > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Minor > Fix For: 8.4 > > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 4.5h > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org