[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930608#comment-16930608 ]
Bruno Roustant commented on LUCENE-8920: ---------------------------------------- Open-addressing benchmark to store byte labels in an array of size between 8 and 256. "Array" means array size. "tries" means max number of slot comparisons allowed (when a slot we want to put a label to is not empty - move to the next slot). "loadFactor" = max load factor (num labels / array size) for which we can add all the labels by respecting "tries". Above this max load factor, the open-addressing fails to store all labels and fallbacks to binary search. This max load factor is an average of 1000 random constructions done by the benchmark. "memory" = space increase factor = 1 / loadFactor. Array=8, tries=1, loadFactor=0.400125, memory x 2.499219 Array=8, tries=2, loadFactor=0.63825, memory x 1.5667841 Array=8, tries=3, loadFactor=0.78675, memory x 1.2710518 Array=12, tries=1, loadFactor=0.34391674, memory x 2.9076805 Array=12, tries=2, loadFactor=0.54183316, memory x 1.8455865 Array=12, tries=3, loadFactor=0.6964172, memory x 1.4359208 Array=16, tries=1, loadFactor=0.29825, memory x 3.352892 Array=16, tries=2, loadFactor=0.5039375, memory x 1.9843731 Array=16, tries=3, loadFactor=0.646625, memory x 1.5464914 Array=16, tries=4, loadFactor=0.7493125, memory x 1.3345567 Array=24, tries=1, loadFactor=0.25141662, memory x 3.9774618 Array=24, tries=2, loadFactor=0.44929162, memory x 2.225726 Array=24, tries=3, loadFactor=0.58083373, memory x 1.7216631 Array=24, tries=4, loadFactor=0.6867924, memory x 1.4560441 Array=32, tries=1, loadFactor=0.22528125, memory x 4.4388957 Array=32, tries=2, loadFactor=0.4059375, memory x 2.4634335 Array=32, tries=3, loadFactor=0.53625, memory x 1.8648019 Array=32, tries=4, loadFactor=0.63653123, memory x 1.5710148 Array=32, tries=5, loadFactor=0.7165625, memory x 1.3955517 Array=48, tries=1, loadFactor=0.18899995, memory x 5.2910066 Array=48, tries=2, loadFactor=0.36193773, memory x 2.762906 Array=48, tries=3, loadFactor=0.49154142, memory x 2.0344167 Array=48, tries=4, loadFactor=0.5871248, memory x 1.7032154 Array=48, tries=5, loadFactor=0.6700212, memory x 1.4924902 Array=64, tries=1, loadFactor=0.17101562, memory x 5.8474193 Array=64, tries=2, loadFactor=0.33707812, memory x 2.9666712 Array=64, tries=3, loadFactor=0.46217188, memory x 2.1636972 Array=64, tries=4, loadFactor=0.56409377, memory x 1.7727549 Array=64, tries=5, loadFactor=0.6392813, memory x 1.5642567 Array=64, tries=6, loadFactor=0.6963125, memory x 1.4361368 Array=96, tries=1, loadFactor=0.15294789, memory x 6.5381746 Array=96, tries=2, loadFactor=0.31204194, memory x 3.2046974 Array=96, tries=3, loadFactor=0.42829168, memory x 2.3348575 Array=96, tries=4, loadFactor=0.5277291, memory x 1.8949116 Array=96, tries=5, loadFactor=0.60920805, memory x 1.6414753 Array=96, tries=6, loadFactor=0.66278124, memory x 1.5087935 Array=128, tries=1, loadFactor=0.15130469, memory x 6.6091805 Array=128, tries=2, loadFactor=0.3054219, memory x 3.2741597 Array=128, tries=3, loadFactor=0.43253124, memory x 2.3119717 Array=128, tries=4, loadFactor=0.5283203, memory x 1.8927912 Array=128, tries=5, loadFactor=0.6043203, memory x 1.6547517 Array=128, tries=6, loadFactor=0.66267186, memory x 1.5090425 Array=128, tries=7, loadFactor=0.70567185, memory x 1.4170892 Array=192, tries=1, loadFactor=0.14355215, memory x 6.9661093 Array=192, tries=2, loadFactor=0.32579714, memory x 3.0693946 Array=192, tries=3, loadFactor=0.42888057, memory x 2.3316514 Array=192, tries=4, loadFactor=0.5238486, memory x 1.9089485 Array=192, tries=5, loadFactor=0.59191173, memory x 1.6894411 Array=192, tries=6, loadFactor=0.65411395, memory x 1.5287856 Array=192, tries=7, loadFactor=0.6975258, memory x 1.4336387 Array=256, tries=1, loadFactor=1.0, memory x 1.0 Array=256, tries=2, loadFactor=1.0, memory x 1.0 Array=256, tries=3, loadFactor=1.0, memory x 1.0 Array=256, tries=4, loadFactor=1.0, memory x 1.0 Array=256, tries=5, loadFactor=1.0, memory x 1.0 Array=256, tries=6, loadFactor=1.0, memory x 1.0 Array=256, tries=7, loadFactor=1.0, memory x 1.0 Array=256, tries=8, loadFactor=1.0, memory x 1.0 > Reduce size of FSTs due to use of direct-addressing encoding > ------------------------------------------------------------- > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Mike Sokolov > Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org