[ https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003723#comment-13003723 ]
Michael McCandless commented on LUCENE-2948: -------------------------------------------- I ran perf test w/ latest patch, and also fixed luceneutil to track stddev of the measures: ||Task||QPS base||StdDev base||QPS bushy||StdDev bushy||Pct diff|||| |Fuzzy1|32.41|1.29|21.68|0.65|{color:red}37%{color}-{color:red}28%{color}| |Fuzzy2|21.59|0.79|14.54|0.41|{color:red}36%{color}-{color:red}28%{color}| |Respell|29.52|0.95|28.70|1.07|{color:red}9%{color}-{color:green}4%{color}| |IntNRQ|9.16|1.10|9.03|1.04|{color:red}22%{color}-{color:green}24%{color}| |Wildcard|53.88|3.10|53.25|2.96|{color:red}11%{color}-{color:green}10%{color}| |Prefix3|31.75|2.73|31.41|2.47|{color:red}16%{color}-{color:green}16%{color}| |SloppyPhrase|6.28|0.26|6.24|0.31|{color:red}9%{color}-{color:green}8%{color}| |AndHighHigh|10.81|0.40|10.80|0.36|{color:red}6%{color}-{color:green}7%{color}| |Phrase|6.79|0.42|6.80|0.41|{color:red}11%{color}-{color:green}13%{color}| |AndHighMed|47.27|1.37|47.34|1.10|{color:red}4%{color}-{color:green}5%{color}| |SpanNear|7.72|0.43|7.78|0.34|{color:red}8%{color}-{color:green}11%{color}| |Term|34.21|3.12|35.36|3.04|{color:red}13%{color}-{color:green}23%{color}| |OrHighHigh|12.16|1.18|12.60|1.15|{color:red}14%{color}-{color:green}25%{color}| |OrHighMed|15.20|1.44|15.76|1.43|{color:red}13%{color}-{color:green}24%{color}| |PKLookup|39.59|1.08|56.36|2.07|{color:green}33%{color}-{color:green}51%{color}| The range on the %tg gain/loss takes the best/worst ends of +/1 one stddev. Unfortunately, the patch slows down fuzzy queries, I think because the cost of checking for the next possible prefix exceeds any savings. Though, this is a hot test; it's possible we'd see gains w/ a cold test since we are doing less seeking. But for PK lookup the gains are sizable. But note that this only applies to PK values that are "tight", eg sequential IDs, not to GUIDs, and only then on a relatively "fresh" index (ie after many updates in randomish order the PKs will be randomly distributed and the gains will be gone). I think for now we should not expose the nextPossiblePrefix (or at least not use it from ATE)? > Make var gap terms index a partial prefix trie > ---------------------------------------------- > > Key: LUCENE-2948 > URL: https://issues.apache.org/jira/browse/LUCENE-2948 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948.patch, > LUCENE-2948_automaton.patch > > > Var gap stores (in an FST) the indexed terms (every 32nd term, by > default), minus their non-distinguishing suffixes. > However, often times the resulting FST is "close" to a prefix trie in > some portion of the terms space. > By allowing some nodes of the FST to store all outgoing edges, > including ones that do not lead to an indexed term, and by recording > that this node is then "authoritative" as to what terms exist in the > terms dict from that prefix, we can get some important benefits: > * It becomes possible to know that a certain term prefix cannot > exist in the terms index, which means we can save a disk seek in > some cases (like PK lookup, docFreq, etc.) > * We can query for the next possible prefix in the index, allowing > some MTQs (eg FuzzyQuery) to save disk seeks. > Basically, the terms index is able to answer questions that previously > required seeking/scanning in the terms dict file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org