[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-16 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930601#comment-16930601
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 9/16/19 2:26 PM:
-

I tried a quick benchmark to evaluate the load factor (num labels / array size) 
we can expect for open-addressing. This gives us a good estimate of the space 
requirement.

This load factor depends on the number of tries (called L above) we accept - 
this defines the worst case perf. For each power-of-two array size, the 
benchmark attempts various L, and outputs the max load factor (above this load 
factor, the open-addressing aborts and we fallback to binary search) (benchmark 
print in next post below).

Outcome:
 * Load factor slightly above 50% for good perf (for L = log(N)/2 + 1), as 
expected for open-addressing in literature.
 * It is also possible to use a power-of-two x 1.5 array size with efficient 
hash (to stay at load factor 50% in all cases).
 * We have to encode the “empty slot” (no label in a given array slot). 
Probably with both a 0 label and 0 value (if the node needs that, then abort 
and fallback to binary search).

This means we have to expect a space increase of 2x (compared to binary 
search), for better perf than binary search (L = log(N)/2 + 1, which is the 
worst case, most of the time the open-addressing stops when encountering an 
empty slot before L).

 

To me this balance between space and performance cannot be hardcoded. This 
depends on the use-case. There should be a balance tuning parameter in the FST 
constructor (e.g. max-perf, perf-over-space, space-over-perf, min-space). And 
based on this balance we could set the values of a couple of thresholds that 
define when to use direct-addressing, open-addressing, binary-search, 
compact-list.


was (Author: bruno.roustant):
I tried a quick benchmark to evaluate the load factor (num labels / array size) 
we can expect for open-addressing. This gives us a good estimate of the space 
requirement.

This load factor depends on the number of tries (called L above) we accept - 
this defines the worst case perf. For each power-of-two array size, the 
benchmark tries various L, and outputs the max load factor (above this load 
factor, the open-addressing aborts and we fallback to binary search) (benchmark 
print in next post below).

Outcome:
 * Load factor slightly above 50% for good perf (for L = log(N)/2 + 1), as 
expected for open-addressing in literature.
 * It is also possible to use a power-of-two x 1.5 array size with efficient 
hash (to stay at load factor 50% in all cases).
 * We have to encode the “empty slot” (no label in a given array slot). 
Probably with both a 0 label and 0 value (if the node needs that, then abort 
and fallback to binary search).

This means we have to expect a space increase of 2x (compared to binary 
search), for better perf than binary search (L = log(N)/2 + 1, which is the 
worst case, most of the time the open-addressing stops when encountering an 
empty slot before L).

 

To me this balance between space and performance cannot be hardcoded. This 
depends on the use-case. There should be a balance tuning parameter in the FST 
constructor (e.g. max-perf, perf-over-space, space-over-perf, min-space). And 
based on this balance we could set the values of a couple of thresholds that 
define when to use direct-addressing, open-addressing, binary-search, 
compact-list.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by A

[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-16 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930608#comment-16930608
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 9/16/19 2:29 PM:
-

Open-addressing benchmark to store byte labels in an array of size between 8 
and 256.

"Array" means array size.

"tries" means max number of slot comparisons allowed (when a slot we want to 
put a label to is not empty - move to the next slot).

"loadFactor" = max load factor (num labels / array size) for which we can add 
all the labels by respecting "tries". Above this max load factor, the 
open-addressing fails to store all labels and fallbacks to binary search. This 
max load factor is an average of the max load factor for 1000 random 
constructions done by the benchmark.

"memory" = space increase factor = 1 / loadFactor.

 

Array=8, tries=1, loadFactor=0.400125, memory x 2.499219
 Array=8, tries=2, loadFactor=0.63825, memory x 1.5667841
 Array=8, tries=3, loadFactor=0.78675, memory x 1.2710518
 Array=12, tries=1, loadFactor=0.34391674, memory x 2.9076805
 Array=12, tries=2, loadFactor=0.54183316, memory x 1.8455865
 Array=12, tries=3, loadFactor=0.6964172, memory x 1.4359208
 Array=16, tries=1, loadFactor=0.29825, memory x 3.352892
 Array=16, tries=2, loadFactor=0.5039375, memory x 1.9843731
 Array=16, tries=3, loadFactor=0.646625, memory x 1.5464914
 Array=16, tries=4, loadFactor=0.7493125, memory x 1.3345567
 Array=24, tries=1, loadFactor=0.25141662, memory x 3.9774618
 Array=24, tries=2, loadFactor=0.44929162, memory x 2.225726
 Array=24, tries=3, loadFactor=0.58083373, memory x 1.7216631
 Array=24, tries=4, loadFactor=0.6867924, memory x 1.4560441
 Array=32, tries=1, loadFactor=0.22528125, memory x 4.4388957
 Array=32, tries=2, loadFactor=0.4059375, memory x 2.4634335
 Array=32, tries=3, loadFactor=0.53625, memory x 1.8648019
 Array=32, tries=4, loadFactor=0.63653123, memory x 1.5710148
 Array=32, tries=5, loadFactor=0.7165625, memory x 1.3955517
 Array=48, tries=1, loadFactor=0.1885, memory x 5.2910066
 Array=48, tries=2, loadFactor=0.36193773, memory x 2.762906
 Array=48, tries=3, loadFactor=0.49154142, memory x 2.0344167
 Array=48, tries=4, loadFactor=0.5871248, memory x 1.7032154
 Array=48, tries=5, loadFactor=0.6700212, memory x 1.4924902
 Array=64, tries=1, loadFactor=0.17101562, memory x 5.8474193
 Array=64, tries=2, loadFactor=0.33707812, memory x 2.9666712
 Array=64, tries=3, loadFactor=0.46217188, memory x 2.1636972
 Array=64, tries=4, loadFactor=0.56409377, memory x 1.7727549
 Array=64, tries=5, loadFactor=0.6392813, memory x 1.5642567
 Array=64, tries=6, loadFactor=0.6963125, memory x 1.4361368
 Array=96, tries=1, loadFactor=0.15294789, memory x 6.5381746
 Array=96, tries=2, loadFactor=0.31204194, memory x 3.2046974
 Array=96, tries=3, loadFactor=0.42829168, memory x 2.3348575
 Array=96, tries=4, loadFactor=0.5277291, memory x 1.8949116
 Array=96, tries=5, loadFactor=0.60920805, memory x 1.6414753
 Array=96, tries=6, loadFactor=0.66278124, memory x 1.5087935
 Array=128, tries=1, loadFactor=0.15130469, memory x 6.6091805
 Array=128, tries=2, loadFactor=0.3054219, memory x 3.2741597
 Array=128, tries=3, loadFactor=0.43253124, memory x 2.3119717
 Array=128, tries=4, loadFactor=0.5283203, memory x 1.8927912
 Array=128, tries=5, loadFactor=0.6043203, memory x 1.6547517
 Array=128, tries=6, loadFactor=0.66267186, memory x 1.5090425
 Array=128, tries=7, loadFactor=0.70567185, memory x 1.4170892
 Array=192, tries=1, loadFactor=0.14355215, memory x 6.9661093
 Array=192, tries=2, loadFactor=0.32579714, memory x 3.0693946
 Array=192, tries=3, loadFactor=0.42888057, memory x 2.3316514
 Array=192, tries=4, loadFactor=0.5238486, memory x 1.9089485
 Array=192, tries=5, loadFactor=0.59191173, memory x 1.6894411
 Array=192, tries=6, loadFactor=0.65411395, memory x 1.5287856
 Array=192, tries=7, loadFactor=0.6975258, memory x 1.4336387
 Array=256, tries=1, loadFactor=1.0, memory x 1.0
 Array=256, tries=2, loadFactor=1.0, memory x 1.0
 Array=256, tries=3, loadFactor=1.0, memory x 1.0
 Array=256, tries=4, loadFactor=1.0, memory x 1.0
 Array=256, tries=5, loadFactor=1.0, memory x 1.0
 Array=256, tries=6, loadFactor=1.0, memory x 1.0
 Array=256, tries=7, loadFactor=1.0, memory x 1.0
 Array=256, tries=8, loadFactor=1.0, memory x 1.0


was (Author: bruno.roustant):
Open-addressing benchmark to store byte labels in an array of size between 8 
and 256.

"Array" means array size.

"tries" means max number of slot comparisons allowed (when a slot we want to 
put a label to is not empty - move to the next slot).

"loadFactor" = max load factor (num labels / array size) for which we can add 
all the labels by respecting "tries". Above this max load factor, the 
open-addressing fails to store all labels and fallbacks to binary search. This 
max load factor is an average of 1000 random constructions done b

[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-17 Thread Mike Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931367#comment-16931367
 ] 

Mike Sokolov edited comment on LUCENE-8920 at 9/17/19 12:06 PM:


This is cool. Regarding the strategy for which encoding to apply, I'll just 
call out the current  heuristics:

{{doFixedArray = node depth (distance from root node) <= 3 and N >= 5) or N >= 
10}}

{{writeDirectly = doFixedArray && (max_label - min_label) < 4 * N}}

 

{{I think we can still consider that we would apply list-encoding for small N, 
and consider open addressing as a variant within "doFixedArray," where we now 
can choose among direct addressing (for least load factors L), open addressing 
(for intermediate case), and binary search for highest L. Does that sound 
right?}}

 

{{I wonder if we could work backwards from a single parameter L: maximum memory 
cost (vs list encoding). The API would guarantee that no set of arcs is ever 
encoded using more than L * the minimum possible, and then internally we choose 
the best (ie fastest lookup) encoding that achieves that, possibly with some 
tweak for higher-order arcs (ie near the root).}}


was (Author: sokolov):
This is cool. Regarding the strategy for which encoding to apply, I'll just 
call out the current  heuristics:


{{doFixedArray = }}{{(node depth (distance from root node) <= 3 and N >= 5) or 
N >= 10}}

{{writeDirectly = doFixedArray && (max_label - min_label) < 4 * N}}

 

{{I think we can still consider that we would apply list-encoding for small N, 
and consider open addressing as a variant within "doFixedArray," where we now 
can choose among direct addressing (for least load factors L), open addressing 
(for intermediate case), and binary search for highest L. Does that sound 
right?}}

 

{{I wonder if we could work backwards from a single parameter L: maximum memory 
cost (vs list encoding). The API would guarantee that no set of arcs is ever 
encoded using more than L * the minimum possible, and then internally we choose 
the best (ie fastest lookup) encoding that achieves that, possibly with some 
tweak for higher-order arcs (ie near the root).}}

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-10-11 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949647#comment-16949647
 ] 

Michael Sokolov edited comment on LUCENE-8920 at 10/11/19 5:13 PM:
---

{{For posterity, this is the worst case test that spreads out terms}}

for (int i = 0; i < 100; ++i) {
   byte[] b = new byte[5];
   random().nextBytes(b);
   for (int j = 0; j < b.length; ++j)

{     b[j] &= 0xfc; // make this byte a multiple of 4   }

 entries.add(new BytesRef(b));
 }

buildFST(entries).ramBytesUsed();


was (Author: sokolov):
{{For posterity, this is the worst case test that spreads out terms}}

{{}}for (int i = 0; i < 100; ++i) {
  byte[] b = new byte[5];
  random().nextBytes(b);
  for (int j = 0; j < b.length; ++j) {
    b[j] &= 0xfc; // make this byte a multiple of 4
  }
 entries.add(new BytesRef(b));
}

buildFST(entries).ramBytesUsed();

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-10-11 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949647#comment-16949647
 ] 

Michael Sokolov edited comment on LUCENE-8920 at 10/11/19 5:14 PM:
---

For posterity, this is the worst case test that spreads out terms
{{
for (int i = 0; i < 100; ++i) {
   byte[] b = new byte[5];
   random().nextBytes(b);
   for (int j = 0; j < b.length; ++j){
      b[j] &= 0xfc; // make this byte a multiple of 4
   }
    entries.add(new BytesRef(b));
 }
 buildFST(entries).ramBytesUsed();}}



was (Author: sokolov):
{{For posterity, this is the worst case test that spreads out terms}}

for (int i = 0; i < 100; ++i) {
   byte[] b = new byte[5];
   random().nextBytes(b);
   for (int j = 0; j < b.length; ++j)

{     b[j] &= 0xfc; // make this byte a multiple of 4   }

 entries.add(new BytesRef(b));
 }

buildFST(entries).ramBytesUsed();

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-10-14 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951033#comment-16951033
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 10/14/19 2:41 PM:
--

Update about my try with open-addressing.

In fact open-addressing is not compatible with FST contract. Indeed FST not 
only needs a mapping label-to-value, but also requires the ordering of 
label-value entries (for FSTEnum for example to get the next arc). 
Open-addressing does not keep the ordering. Dead-end, I missed that point.

So I'd like to continue helping on this, by leveraging [~jpountz]'s test and 
trying to find a heuristic that takes advantage of direct-addressing while 
limiting the memory overhead.

Medium term I'd like to explore another option: allow binary search on variable 
length list. If we can do this efficiently, maybe we could reduce the memory 
and still have good perf, potentially to buy memory to use more 
direct-addressing.
 The idea is to encode differently the VInt/VLong specially for FST key-value 
entries. Currently VInt encoding uses a bit to tell if there is a next 
additional byte. We could use this bit instead to flag the label bytes (let's 
say this bit is 0), and to flag differently the value bytes (this bit is 1). 
Let's call it keyflag bit. So the new VInt-for-FST encoding is a series of 
bytes having the same keyflag bit until this bit changes (and the last byte 
read is ignored). This does not require more memory, and this allows binary 
search: jump to the middle of the whole byte range of the node (indistinctly 
keys and values), there read the keyflag and move next/previous until the first 
byte of a key is found, then read the VInt-for-FST. And so on like a binary 
search.
 There is a little byte reading overhead compared to fixed length binary 
search, but it is more compact so maybe more IO cache hit on average. So perf 
to be evaluated.

[~mikemccand] what is your opinion on the above idea? Could it fix LUCENE-4682 
if it shows good enough perf?


was (Author: bruno.roustant):
Update about my try with open-addressing.

In fact open-addressing is not compatible with FST contract. Indeed FST not 
only needs a mapping label-to-value, but also requires the ordering of 
label-value entries (for FSTEnum for example to get the next arc). 
Open-addressing does not keep the ordering. Dead-end, I missed that point.

So I'd like to continue helping on this, by leveraging [~jpountz]'s test and 
trying to find a heuristic that takes advantage of direct-addressing while 
limiting the memory overhead.

Medium term I'd like to explore another option: allow binary search on variable 
length list. If we can do this efficiently, maybe we could reduce the memory 
and still have good perf, potentially to buy memory to use more 
direct-addressing.
The idea is to encode differently the VInt/VLong specially for FST key-value 
entries. Currently VInt encoding uses a bit to tell if there is a next 
additional byte. We could use this bit instead to flag the label bytes (let's 
say this bit is 0), and to flag differently the value bytes (this bit is 1). 
Let's call it keyflag bit. So the new VInt-for-FST encoding is a series of 
bytes having the same keyflag bit until this bit changes (and the last byte 
read is ignored). This does not require more memory, and this allows binary 
search: jump to the middle of the whole byte range of the node (indistinctly 
keys and values), there read the keyflag and move next/previous until the first 
byte of a key is found, then read the VInt-for-FST. And so on like a binary 
search.
There is a little byte reading overhead compared to fixed length binary search, 
but it is more compact so maybe more IO cache hit on average. So perf to be 
evaluated.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Blocker
> Fix For: 8.3
>
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a d

[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-08 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970444#comment-16970444
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 11/8/19 5:35 PM:
-

It works. I removed the labels for direct-addressing, except the first label of 
the arc.
 It makes direct-addressing encoding even more compact. Now we have 48% of the 
nodes with fixed length arcs encoded with direct-encoding by using only +12% 
memory (instead of +23% with my previous patch).

I also did several things:
 The presence bits table is now a reused structure (does not create long[] 
anymore each time we read an arc).
 I touched more code, but I think the code is now a bit clearer with more 
comments.
 I moved the bit twiddling methods to BitUtil.

Thanks reviewers!

 

I noticed the direct-addressing limitation: it is less performant than binary 
search for reverse lookup (lookup by output).

Is there a periodic automated monitoring of the performance / memory of 
commits? Could you give me the link? I would like to watch the change with this 
PR.

 

 I definitely would like to clean more FST. But that will be the subject of 
another Jira issue later. Too much duplicated code that makes reasoning and 
modifications hard. Also missing docs and method contracts (e.g. expected 
position in the byte input).


was (Author: bruno.roustant):
It works. I removed the labels for direct-addressing, except the first label of 
the arc.
It makes direct-addressing encoding even more compact. Now we have 48% of the 
nodes with fixed length arcs encoded with direct-encoding by using only +12% 
memory (instead of +23% with my previous patch).

I also did several things:
The presence bits table is now a reused structure (does not create long[] 
anymore each time we read an arc).
I touched more code, but I think the code is now a bit clearer with more 
comments.
I moved the bit twiddling methods to BitUtil.

Thanks reviewers!

 

I noticed the direct-addressing limitation: it is less performant than binary 
search for reverse lookup (lookup by output).

Is there an periodic automated monitoring of the performance / memory of 
commits? Could you give me the link? I would like to watch the change with this 
PR.

 

 I definitely would like to clean more FST. But that will be the subject of 
another Jira issue later. Too much duplicated code that makes reasoning and 
modifications hard. Also missing docs and method contracts (e.g. expected 
position in the byte input).

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-08 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970444#comment-16970444
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 11/8/19 5:36 PM:
-

It works. PR#980 updated. I removed the labels for direct-addressing, except 
the first label of the arc.
 It makes direct-addressing encoding even more compact. Now we have 48% of the 
nodes with fixed length arcs encoded with direct-encoding by using only +12% 
memory (instead of +23% with my previous patch).

I also did several things:
 The presence bits table is now a reused structure (does not create long[] 
anymore each time we read an arc).
 I touched more code, but I think the code is now a bit clearer with more 
comments.
 I moved the bit twiddling methods to BitUtil.

Thanks reviewers!

 

I noticed the direct-addressing limitation: it is less performant than binary 
search for reverse lookup (lookup by output).

Is there a periodic automated monitoring of the performance / memory of 
commits? Could you give me the link? I would like to watch the change with this 
PR.

 

 I definitely would like to clean more FST. But that will be the subject of 
another Jira issue later. Too much duplicated code that makes reasoning and 
modifications hard. Also missing docs and method contracts (e.g. expected 
position in the byte input).


was (Author: bruno.roustant):
It works. I removed the labels for direct-addressing, except the first label of 
the arc.
 It makes direct-addressing encoding even more compact. Now we have 48% of the 
nodes with fixed length arcs encoded with direct-encoding by using only +12% 
memory (instead of +23% with my previous patch).

I also did several things:
 The presence bits table is now a reused structure (does not create long[] 
anymore each time we read an arc).
 I touched more code, but I think the code is now a bit clearer with more 
comments.
 I moved the bit twiddling methods to BitUtil.

Thanks reviewers!

 

I noticed the direct-addressing limitation: it is less performant than binary 
search for reverse lookup (lookup by output).

Is there a periodic automated monitoring of the performance / memory of 
commits? Could you give me the link? I would like to watch the change with this 
PR.

 

 I definitely would like to clean more FST. But that will be the subject of 
another Jira issue later. Too much duplicated code that makes reasoning and 
modifications hard. Also missing docs and method contracts (e.g. expected 
position in the byte input).

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-09 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970888#comment-16970888
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 11/9/19 5:22 PM:
-

{quote}Out of curiosity, have you confirmed this?
{quote}
It's by design. The oversizing factor is just a multiplier. The rule is "encode 
with direct addressing if size-of-direct-addressing <= oversizing-factor x 
size-of-binary-search". So if oversizing factor is 1 we don't encode with 
direct addressing unless we reduce or equal the size. So the whole FST memory 
can only reduce, never increase.
{quote}I worry that values greater than 1 might mostly make this new encoding 
used on nodes that don't have that many arcs
{quote}
There is still the old rule that encodes with fixed length arcs only nodes with 
enough arcs and with low depth in the tree 
(FST.shouldExpandNodeWithFixedLengthArcs()).
{quote}so even something like a 10% increase could translate to hundreds of 
megabytes.
{quote}
This Jira issue is "reduce size of FST due to use of direct-addressing". I 
thought we were trying here to reduce the memory increase, not necessarily 
remove it completely. That said I understand the concern. If we ensure by 
default we finally don't increase at all the memory by controlling 
direct-addressing (so the perf goal is somewhat reduced), how easy will it be 
for anyone to use BlockTree posting format with a custom setting for FST ? Will 
it be an additional BlockTree posting format with different name and settings?

I'll implement the memory reduction/oversizing credit to have more precise 
control.
 But the final decision for the default memory/perf balance should be taken 
based on some measures. [~jpountz] would you be able to have these measures?

 


was (Author: bruno.roustant):
{quote}Out of curiosity, have you confirmed this?
{quote}
It's by design. The oversizing factor is just a multiplier. The rule is "encode 
with direct addressing if size-of-direct-addressing <= oversizing-factor x 
size-of-binary-search". So if oversizing factor is 1 we don't encode with 
direct addressing unless we reduce or equal the size. So the whole FST memory 
can only reduce, never increase.
{quote}I worry that values greater than 1 might mostly make this new encoding 
used on nodes that don't have that many arcs
{quote}
There is still the old rule that encodes with fixed length arcs only nodes with 
enough arcs and with sufficient depth in the tree 
(FST.shouldExpandNodeWithFixedLengthArcs()).
{quote}so even something like a 10% increase could translate to hundreds of 
megabytes.
{quote}
This Jira issue is "reduce size of FST due to use of direct-addressing". I 
thought we were trying here to reduce the memory increase, not necessarily 
remove it completely. That said I understand the concern. If we ensure by 
default we finally don't increase at all the memory by controlling 
direct-addressing (so the perf goal is somewhat reduced), how easy will it be 
for anyone to use BlockTree posting format with a custom setting for FST ? Will 
it be an additional BlockTree posting format with different name and settings?

I'll implement the memory reduction/oversizing credit to have more precise 
control.
But the final decision for the default memory/perf balance should be taken 
based on some measures. [~jpountz] would you be able to have these measures?

 

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8

[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-10 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971138#comment-16971138
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 11/10/19 3:03 PM:
--

I added the expansion credit to the PR#980. This indeed gives more control on 
the oversizing factor and the result is better than I anticipated.

All the previous measures I did were for the worst case for direct addressing. 
This time I also measured the FST size when building from an English dictionary 
of 479K words (including alphanumeric).

I used direct addressing oversizing factor = 1 (ensures no oversizing on 
average)

For English words I measured 217K nodes, only 3.27% nodes (top nodes in the 
FST) are encoded with fixed length arcs, and 99.99% of them with direct 
addressing. Overall FST memory *reduced* by 1.67%.

For worst case I measured 168K nodes, 50% nodes are encoded with fixed length 
arcs, and 14% of them with direct encoding. Overall FST memory *reduced* by 
0.8%.

 

I’m confident that with this last PR we will not increase the FST memory 
(compared to without direct addressing). At the same time the top nodes with 
many arcs will most often be encoded with direct addressing. And the worst case 
is controlled so we still ensure no memory increase (while still trying to use 
direct addressing when favorable).


was (Author: bruno.roustant):
I added the expansion credit to the PR#980. This indeed gives more control on 
the oversizing factor and the result is better than I anticipated.

All the previous measures I did were for the worst case for direct addressing. 
This time I also measured the FST size when building from an English dictionary 
of 479K words (including alphanumeric).

I used direct addressing oversizing factor = 1 (ensures no oversizing on 
average)

For English words I measured 217K nodes, only 3.27% nodes (top nodes in the 
FST) are encoded with fixed length arcs, and 99.99% of them with direct 
addressing. Overall FST memory *reduced* by 1.67%.

For worst case I measured 168K nodes, 50% of them are encoded with fixed length 
arcs, and 14% of them with direct encoding. Overall FST memory *reduced* by 
0.8%.

 

I’m confident that with this last PR we will not increase the FST memory 
(compared to without direct addressing). At the same time the top nodes with 
many arcs will most often be encoded with direct addressing. And the worst case 
is controlled so we still ensure no memory increase (while still trying to use 
direct addressing when favorable).

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-10 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971138#comment-16971138
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 11/10/19 3:05 PM:
--

I added the expansion credit to the PR#980. This indeed gives more control on 
the oversizing factor and the result is better than I anticipated.

All the previous measures I did were for the worst case for direct addressing. 
This time I also measured the FST size when building from an English dictionary 
of 479K words (including alphanumeric).

I used direct addressing oversizing factor = 1 (ensures no oversizing on 
average)

For English words I measured 217K nodes, only 3.27% nodes (top nodes in the 
FST) are encoded with fixed length arcs, and 99.99% of them (of the 3.27%) with 
direct addressing. Overall FST memory *reduced* by 1.67%.

For worst case I measured 168K nodes, 50% nodes are encoded with fixed length 
arcs, and 14% of them (of the 50%) with direct addressing. Overall FST memory 
*reduced* by 0.8%.

 

I’m confident that with this last PR we will not increase the FST memory 
(compared to without direct addressing). At the same time the top nodes with 
many arcs will most often be encoded with direct addressing. And the worst case 
is controlled so we still ensure no memory increase (while still trying to use 
direct addressing when favorable).


was (Author: bruno.roustant):
I added the expansion credit to the PR#980. This indeed gives more control on 
the oversizing factor and the result is better than I anticipated.

All the previous measures I did were for the worst case for direct addressing. 
This time I also measured the FST size when building from an English dictionary 
of 479K words (including alphanumeric).

I used direct addressing oversizing factor = 1 (ensures no oversizing on 
average)

For English words I measured 217K nodes, only 3.27% nodes (top nodes in the 
FST) are encoded with fixed length arcs, and 99.99% of them with direct 
addressing. Overall FST memory *reduced* by 1.67%.

For worst case I measured 168K nodes, 50% nodes are encoded with fixed length 
arcs, and 14% of them with direct encoding. Overall FST memory *reduced* by 
0.8%.

 

I’m confident that with this last PR we will not increase the FST memory 
(compared to without direct addressing). At the same time the top nodes with 
many arcs will most often be encoded with direct addressing. And the worst case 
is controlled so we still ensure no memory increase (while still trying to use 
direct addressing when favorable).

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-12 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972414#comment-16972414
 ] 

Michael Sokolov edited comment on LUCENE-8920 at 11/12/19 1:53 PM:
---

+1 for merging, and handling {{cachedRootArcs}} separately


was (Author: sokolov):
+1 for merging, and handling {cachedRootArcs} separately

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-14 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974198#comment-16974198
 ] 

Michael Sokolov edited comment on LUCENE-8920 at 11/14/19 12:38 PM:


I scanned the diff of all the commits together, and I didn't see any issues, 
but boy that is a big change now. Thanks for handling the merge, [~jpountz], 
and nice work on saving both space and time, [~bruno.roustant]!


was (Author: sokolov):
I scanned the diff of all the commits together, and I didn't see any issues, 
but boy that is a big change now. Thanks for handling the merge, [~jpountz]

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Fix For: 8.4
>
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-14 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974569#comment-16974569
 ] 

Michael Sokolov edited comment on LUCENE-8920 at 11/14/19 8:05 PM:
---

I ran the `luceneutil` test just to be sure: yes, 8.x can read indexes created 
by 8.3


was (Author: sokolov):
I'll run the `luceneutil` test just to be sure

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Fix For: 8.4
>
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-14 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974723#comment-16974723
 ] 

Chris M. Hostetter edited comment on LUCENE-8920 at 11/15/19 1:19 AM:
--

TestFstDirectAddressing.testDeDupTails has a 44% failure rate (11/25 runs) on 
jenkins boxes since it was added by this commit (so far only on Uwe's machine, 
but on multiple OSes, and both master and 8.x)


was (Author: hossman):
TestFstDirectAddressing.testDeDupTails has a 44% failure rate on jenkins boxes 
since it was added by this commit (so far only on Uwe's machine, but on 
multiple OSes, and both master and 8.x)

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Fix For: 8.4
>
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org