[ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976709#action_12976709 ]
Michael McCandless commented on LUCENE-2843: -------------------------------------------- As a first test, I just made a policy that's identical to the fixed gap terms index, ie, it just picks every 32nd term as the index term. So this is really a test of the packed int/bytes vs FST. On the 10M Wikipedia test index, the resulting terms index files (= RAM used by SegmentReader) is ~38% smaller (~52% once optimized -- FST "scales up" well). Here's the query perf vs trunk: ||Query||QPS base||QPS vargap||Pct diff|||| |spanFirst(unit, 5)|17.13|16.75|{color:red}-2.2%{color}| |"unit state"~3|5.31|5.20|{color:red}-2.1%{color}| |spanNear([unit, state], 10, true)|4.59|4.52|{color:red}-1.4%{color}| |"unit state"|7.86|7.77|{color:red}-1.1%{color}| |+nebraska +state|204.74|202.85|{color:red}-0.9%{color}| |+unit +state|11.37|11.30|{color:red}-0.6%{color}| |doctimesecnum:[10000 TO 60000]|9.74|9.76|{color:green}0.2%{color}| |unit~1.0|21.70|21.82|{color:green}0.6%{color}| |unit*|26.18|26.55|{color:green}1.4%{color}| |state|29.29|29.75|{color:green}1.6%{color}| |uni*|15.06|15.32|{color:green}1.7%{color}| |unit state|10.73|10.93|{color:green}1.9%{color}| |unit~2.0|21.05|21.45|{color:green}1.9%{color}| |un*d|77.10|79.65|{color:green}3.3%{color}| |u*d|26.41|28.81|{color:green}9.1%{color}| |united~1.0|102.27|116.88|{color:green}14.3%{color}| |united~2.0|25.47|31.18|{color:green}22.4%{color}| It's great that for the seek intensive fuzzy queries, the FST-based seeking is substantially faster. For other queries the term seek time is in the noise. I think we should make this (VariableGapTermsIndex) terms index impl the default (for Standard codec). > Add variable-gap terms index impl. > ---------------------------------- > > Key: LUCENE-2843 > URL: https://issues.apache.org/jira/browse/LUCENE-2843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2843.patch > > > PrefixCodedTermsReader/Writer (used by all "real" core codecs) already > supports pluggable terms index impls. > The only impl we have now is FixedGapTermsIndexReader/Writer, which > picks every Nth (default 32) term and holds it in efficient packed > int/byte arrays in RAM. This is already an enormous improvement (RAM > reduction, init time) over 3.x. > This patch adds another impl, VariableGapTermsIndexReader/Writer, > which lets you specify an arbitrary IndexTermSelector to pick which > terms are indexed, and then uses an FST to hold the indexed terms. > This is typically even more memory efficient than packed int/byte > arrays, though, it does not support ord() so it's not quite a fair > comparison. > I had to relax the terms index plugin api for > PrefixCodedTermsReader/Writer to not assume that the terms index impl > supports ord. > I also did some cleanup of the FST/FSTEnum APIs and impls, and broke > out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor > when the FST is used as a terms index but seekCeil when it's holding > all terms in the index (ie which SimpleText uses FSTs for). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org