[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930608#comment-16930608
 ] 

Bruno Roustant commented on LUCENE-8920:
----------------------------------------

Open-addressing benchmark to store byte labels in an array of size between 8 
and 256.

"Array" means array size.

"tries" means max number of slot comparisons allowed (when a slot we want to 
put a label to is not empty - move to the next slot).

"loadFactor" = max load factor (num labels / array size) for which we can add 
all the labels by respecting "tries". Above this max load factor, the 
open-addressing fails to store all labels and fallbacks to binary search. This 
max load factor is an average of 1000 random constructions done by the 
benchmark.

"memory" = space increase factor = 1 / loadFactor.

 

Array=8, tries=1, loadFactor=0.400125, memory x 2.499219
Array=8, tries=2, loadFactor=0.63825, memory x 1.5667841
Array=8, tries=3, loadFactor=0.78675, memory x 1.2710518
Array=12, tries=1, loadFactor=0.34391674, memory x 2.9076805
Array=12, tries=2, loadFactor=0.54183316, memory x 1.8455865
Array=12, tries=3, loadFactor=0.6964172, memory x 1.4359208
Array=16, tries=1, loadFactor=0.29825, memory x 3.352892
Array=16, tries=2, loadFactor=0.5039375, memory x 1.9843731
Array=16, tries=3, loadFactor=0.646625, memory x 1.5464914
Array=16, tries=4, loadFactor=0.7493125, memory x 1.3345567
Array=24, tries=1, loadFactor=0.25141662, memory x 3.9774618
Array=24, tries=2, loadFactor=0.44929162, memory x 2.225726
Array=24, tries=3, loadFactor=0.58083373, memory x 1.7216631
Array=24, tries=4, loadFactor=0.6867924, memory x 1.4560441
Array=32, tries=1, loadFactor=0.22528125, memory x 4.4388957
Array=32, tries=2, loadFactor=0.4059375, memory x 2.4634335
Array=32, tries=3, loadFactor=0.53625, memory x 1.8648019
Array=32, tries=4, loadFactor=0.63653123, memory x 1.5710148
Array=32, tries=5, loadFactor=0.7165625, memory x 1.3955517
Array=48, tries=1, loadFactor=0.18899995, memory x 5.2910066
Array=48, tries=2, loadFactor=0.36193773, memory x 2.762906
Array=48, tries=3, loadFactor=0.49154142, memory x 2.0344167
Array=48, tries=4, loadFactor=0.5871248, memory x 1.7032154
Array=48, tries=5, loadFactor=0.6700212, memory x 1.4924902
Array=64, tries=1, loadFactor=0.17101562, memory x 5.8474193
Array=64, tries=2, loadFactor=0.33707812, memory x 2.9666712
Array=64, tries=3, loadFactor=0.46217188, memory x 2.1636972
Array=64, tries=4, loadFactor=0.56409377, memory x 1.7727549
Array=64, tries=5, loadFactor=0.6392813, memory x 1.5642567
Array=64, tries=6, loadFactor=0.6963125, memory x 1.4361368
Array=96, tries=1, loadFactor=0.15294789, memory x 6.5381746
Array=96, tries=2, loadFactor=0.31204194, memory x 3.2046974
Array=96, tries=3, loadFactor=0.42829168, memory x 2.3348575
Array=96, tries=4, loadFactor=0.5277291, memory x 1.8949116
Array=96, tries=5, loadFactor=0.60920805, memory x 1.6414753
Array=96, tries=6, loadFactor=0.66278124, memory x 1.5087935
Array=128, tries=1, loadFactor=0.15130469, memory x 6.6091805
Array=128, tries=2, loadFactor=0.3054219, memory x 3.2741597
Array=128, tries=3, loadFactor=0.43253124, memory x 2.3119717
Array=128, tries=4, loadFactor=0.5283203, memory x 1.8927912
Array=128, tries=5, loadFactor=0.6043203, memory x 1.6547517
Array=128, tries=6, loadFactor=0.66267186, memory x 1.5090425
Array=128, tries=7, loadFactor=0.70567185, memory x 1.4170892
Array=192, tries=1, loadFactor=0.14355215, memory x 6.9661093
Array=192, tries=2, loadFactor=0.32579714, memory x 3.0693946
Array=192, tries=3, loadFactor=0.42888057, memory x 2.3316514
Array=192, tries=4, loadFactor=0.5238486, memory x 1.9089485
Array=192, tries=5, loadFactor=0.59191173, memory x 1.6894411
Array=192, tries=6, loadFactor=0.65411395, memory x 1.5287856
Array=192, tries=7, loadFactor=0.6975258, memory x 1.4336387
Array=256, tries=1, loadFactor=1.0, memory x 1.0
Array=256, tries=2, loadFactor=1.0, memory x 1.0
Array=256, tries=3, loadFactor=1.0, memory x 1.0
Array=256, tries=4, loadFactor=1.0, memory x 1.0
Array=256, tries=5, loadFactor=1.0, memory x 1.0
Array=256, tries=6, loadFactor=1.0, memory x 1.0
Array=256, tries=7, loadFactor=1.0, memory x 1.0
Array=256, tries=8, loadFactor=1.0, memory x 1.0

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Blocker
>             Fix For: 8.3
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to