[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-06 Thread Mike Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924203#comment-16924203
 ] 

Mike Sokolov commented on LUCENE-8920:
--

If I understand you correctly, T1 is the threshold we introduced earlier this 
year (or its inverse DIRECT_ARC_LOAD_FACTOR in fst.Builder). It's currently set 
to 4, or (1/4 as T1 in your formulation).  There was pre-existing logic to 
decide (var-encoded) list vs. the (fixed-size, packed) array encoding; my 
change was piggy-backed on that. It's a threshold on N that depends on the 
depth in the FST. See FST.shouldExpand.

If you want to write up the open addressing idea in more detail, it's fine to 
add comments here unless you think they are too long / inconvenient to write in 
this form, then maybe attach a doc? I think that goes directly to the point of 
reducing space consumption, so this issue seems like a fine place for it.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924178#comment-16924178
 ] 

Bruno Roustant commented on LUCENE-8920:


I'd love to work on that, but I'm pretty busy so I can't start immediately. If 
you can start on it soon I'll be happy to help and review.

I'll try to think more about the subject. Where should I post my remarks/ideas? 
Here in the thread or in an attached doc?

Some additional thoughts:
 * Threshold T1 to find to decide when direct-addressing is best (N / (max 
label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? 
(although there is the var length encoding difference...). Did you try that, 
what is the perf?
 * Threshold T2 to find to decide if a list is better (N < T2) or if 
open-addressing is more appropriate.
 * If N is close to 2^p, the probability that open-addressing aborts (can't 
store a label in less than L tries) is high. Do we double the array size 
(2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, 
but need some testing about the load factor)
 * I think var-length List and fixed-length Binary-Search options could be 
merged to always have a var-length List that can be binary searched with low 
impact on perf. This is a work in itself, but it can help reduce the FST memory 
and thus free some bytes for the faster options.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Mike Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923530#comment-16923530
 ] 

Mike Sokolov commented on LUCENE-8920:
--

I like this! I would be happy to review if you want to post a patch. I may try 
eventually too if you don't get to it. I hope it should be a bit easier to try 
now that we have done some refactoring here, too. One potential complication is 
we are running out of bits to signal different encodings, but I think there 
should be one or two left?

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923463#comment-16923463
 ] 

Bruno Roustant commented on LUCENE-8920:


Based on some heuristics, Direct-Addressing is the good choice. For example if 
num labels / (max label - min label) >= 75%.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923446#comment-16923446
 ] 

Adrien Grand commented on LUCENE-8920:
--

This is a cool idea.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923428#comment-16923428
 ] 

Bruno Roustant commented on LUCENE-8920:


[~sokolov]  There may be another option to speed-up FST arc lookup while 
limiting the memory increase.

Direct-Addressing option looks up by accessing directly 1 label, and costs up 
to (num labels x 4 x num bytes to encode) bytes.

Label-List option is the opposite, look up needs on average N/2 label 
comparisons, and costs (num labels x var bytes to encode) bytes.

 

Another option is to use open-addressing. Look up would be <= L comparisons 
where we can fix L < log(N)/2 (to be faster than binary search), and would cost 
< (num labels x 2 x num bytes to encode).

The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the 
labels and store them in the array using the open-addressing idea: if a slot is 
occupied, try with the next block. If we can’t store a label in less than L 
tries, then abort and fallback to Label-List or Binary-Search option. At lookup 
we hash the input label and know that we have less than L tries to compare.

This is another compromise speed/memory: faster than binary search (constant 
L), with at least 2x less memory than Direct-Addressing.

It is also possible to combine open-addressing and variable length encoding, by 
finding the first byte starting a label based on the bit used to encode the var 
length additional bytes.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-08-04 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899692#comment-16899692
 ] 

ASF subversion and git services commented on LUCENE-8920:
-

Commit a929ac5dacba1f36c88f78412520ea1cd5f6ba1e in lucene-solr's branch 
refs/heads/SOLR-13105-visual from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a929ac5 ]

LUCENE-8920: precommit errors


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-31 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897665#comment-16897665
 ] 

ASF subversion and git services commented on LUCENE-8920:
-

Commit a929ac5dacba1f36c88f78412520ea1cd5f6ba1e in lucene-solr's branch 
refs/heads/jira/SOLR-13650 from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a929ac5 ]

LUCENE-8920: precommit errors


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-31 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897113#comment-16897113
 ] 

Mike Sokolov commented on LUCENE-8920:
--

[~noble.paul] thanks for fixing - I thought I had been watching the mailing 
list and would have seen any build fails from this, but somehow I did not!

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-31 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896923#comment-16896923
 ] 

ASF subversion and git services commented on LUCENE-8920:
-

Commit a929ac5dacba1f36c88f78412520ea1cd5f6ba1e in lucene-solr's branch 
refs/heads/master from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a929ac5 ]

LUCENE-8920: precommit errors


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-30 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896109#comment-16896109
 ] 

ASF subversion and git services commented on LUCENE-8920:
-

Commit d1706b36babda2dc17fd82d3165607f5c44bd83e in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d1706b3 ]

LUCENE-8920: Fix bug preventing FST duplicate tails from being shared when 
encoded as array-with-gaps


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-26 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894086#comment-16894086
 ] 

ASF subversion and git services commented on LUCENE-8920:
-

Commit fe0c042470dc1a1ba7ffd27f91ac7bc96c3254a0 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fe0c042 ]

LUCENE-8920: remove Arc setters, moving implementations into Arc, or copying 
data into consumers


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-26 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894085#comment-16894085
 ] 

ASF subversion and git services commented on LUCENE-8920:
-

Commit 760f2dbdcb29b993aab8f981d84ccbf2e20e9fa5 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=760f2dbd ]

LUCENE-8920: encapsulate FST.Arc data


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-26 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894087#comment-16894087
 ] 

ASF subversion and git services commented on LUCENE-8920:
-

Commit 92d4e712d5d50d745c5a6c10dacda66198974116 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=92d4e71 ]

LUCENE-8920: refactor FST binary search


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-19 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1675#comment-1675
 ] 

Mike Sokolov commented on LUCENE-8920:
--

bq. I'm making it a blocker for 8.3 since we haven't reverted from branch_8x.

Fair enough. I have the commits prepared for the refactoring steps,
but I foolishly did them on a home pc that I cannot access r.n.; will
post tonight.



> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-19 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1612#comment-1612
 ] 

Adrien Grand commented on LUCENE-8920:
--

This sounds good Mike. I'm making it a blocker for 8.3 since we haven't 
reverted from branch_8x.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-18 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888416#comment-16888416
 ] 

Mike Sokolov commented on LUCENE-8920:
--

Before digging in in earnest on FST size reduction, I'd like to tighten up the 
FST.Arc contract. Right now it has all public members and no methods to speak 
of, so the abstraction boundary is not well defined, and in fact we see 
consumers modifying Arc members in a few places outside of the FST class 
itself. This makes it more difficult to reason about the code and make provably 
valid changes. My plan is to do some nonfunctional commits:

1. Add accessors (mostly getters, a few setters will be needed temporarily) to 
Arc, and make all of its members private. It seems as if we often write 
accessors with the same name as the member (rather than the bean standard), so 
I'll go with that.
2. Eliminate the setters; this will require some light refactoring in FSTEnum, 
and a few changes to the memory codec, which keeps a list of Arcs locally and 
updates them for its own purposes.
3. Some refactoring and general cleanup (tightening up access, whitespace 
fixes, etc)

Because that first step is going to touch a lot of files, keep it very strictly 
about introducing the accessors, so there won't be anything beyond changing 
things like `arc.flags` to `arc.flags()`, in a lot of places.

Once these changes are in, the fun can begin again :) I'll add Adrien's 
worst-case test and work on getting the size down for that, pursuing the ideas 
in the description.


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-17 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887089#comment-16887089
 ] 

Mike Sokolov commented on LUCENE-8920:
--

Note: I pushed the old-format Kuromoji dictionary and it seems to have fixed 
the build

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-16 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886603#comment-16886603
 ] 

Mike Sokolov commented on LUCENE-8920:
--

Yes, that makes sense. Because we reverted the "current version" in FST.java, 
we can no longer read FSTs created with the newer version, so we need to revert 
the dictionary file.  I'll do that and run a full suite of tests just to make 
sure something else isn't still broken

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-16 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886592#comment-16886592
 ] 

Tomoko Uchida commented on LUCENE-8920:
---

FYI: This is not related to my revert, I think a kuromoji dictionary data 
(o.a.l.a.ja.dict.TokenInfoDictionary$fst.dat) also should be reverted 
(regenerated) when FST version is reverted.
This works for me.

{code}
$ ant -f lucene/analysis/kuromoji/build.xml regenerate
$ git status
Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

modified:   
lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict/TokenInfoDictionary$fst.dat
$ ant -f lucene/analysis/kuromoji/build.xml test
BUILD SUCCESSFUL
Total time: 33 seconds
{code}

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-16 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886513#comment-16886513
 ] 

Hoss Man commented on LUCENE-8920:
--

[~sokolov] - your revert on branch_8_2 seems to have broken most of the 
lucene/analysis/kuromoji tests with a common root cause...

{noformat}
  [junit4] ERROR   0.44s J0 | TestFactories.test <<<
   [junit4]> Throwable #1: java.lang.ExceptionInInitializerError
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([B1B94D34D92CDA93:39ED72EE77D0B76B]:0)
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.getInstance(TokenInfoDictionary.java:62)
   [junit4]>at 
org.apache.lucene.analysis.ja.JapaneseTokenizer.(JapaneseTokenizer.java:215)
   [junit4]>at 
org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:150)
   [junit4]>at 
org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:82)
   [junit4]>at 
org.apache.lucene.analysis.ja.TestFactories$FactoryAnalyzer.createComponents(TestFactories.java:174)
   [junit4]>at 
org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:199)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkResetException(BaseTokenStreamTestCase.java:427)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:546)
   [junit4]>at 
org.apache.lucene.analysis.ja.TestFactories.doTestTokenizer(TestFactories.java:81)
   [junit4]>at 
org.apache.lucene.analysis.ja.TestFactories.test(TestFactories.java:60)
   [junit4]>at java.lang.Thread.run(Thread.java:748)
   [junit4]> Caused by: java.lang.RuntimeException: Cannot load 
TokenInfoDictionary.
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.(TokenInfoDictionary.java:71)
   [junit4]>... 46 more
   [junit4]> Caused by: org.apache.lucene.index.IndexFormatTooNewException: 
Format version is not supported (resource 
org.apache.lucene.store.InputStreamDataInput@5f0dbb2f): 7 (needs to be between 
6 and 6)
   [junit4]>at 
org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:216)
   [junit4]>at 
org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:198)
   [junit4]>at org.apache.lucene.util.fst.FST.(FST.java:275)
   [junit4]>at org.apache.lucene.util.fst.FST.(FST.java:263)
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.(TokenInfoDictionary.java:47)
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.(TokenInfoDictionary.java:54)
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.(TokenInfoDictionary.java:32)
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.(TokenInfoDictionary.java:69)
   [junit4]>... 46 more

{noformat}

...perhaps due to "conflicting reverts" w/ LUCENE-8907 / LUCENE-8778 ?
/cc [~tomoko]

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-16 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886076#comment-16886076
 ] 

ASF subversion and git services commented on LUCENE-8920:
-

Commit d8b510bead86d4c6ec59063519894d207ee99d5e in lucene-solr's branch 
refs/heads/branch_8_2 from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d8b510b ]

LUCENE-8920: disable FST direct-addressing pending size reduction
  revert to FST version 6
  removed CHANGES entry


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org