[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813106#comment-16813106 ]
Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 8:15 AM: ---------------------------------------------------------------- It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h indexing initially - a little less for UniformSplit, then I had an exception about facets). Then I got results which surprised me. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I didn't understand why a different behavior between a small and a large index. Then I thought about 2 explanations: * Much larger index could mean less OS IO cache hits. I ran the benchmark with a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my test. * Much larger index could mean more results. So the time spent to score and rank the results could become much larger and diminish the effect of a change in the dictionary. I have no clue there at the moment. Here is the result of wikimedimall on a 64 GB desktop: (I used -Jira option, but it does not seem to recognize the "color" tag) ||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff|||| |Fuzzy1|72.81|3.11|21.77|0.71|\{color:red}72%\{color}-\{color:red}67%\{color}| |Fuzzy2|66.77|3.77|20.41|0.67|\{color:red}72%\{color}-\{color:red}66%\{color}| |Respell|8.85|0.64|6.02|0.33|\{color:red}40%\{color}-\{color:red}22%\{color}| |PKLookup|130.83|3.96|121.66|12.37|\{color:red}18%\{color}-\{color:green}5%\{color}| |Wildcard|25.03|1.33|23.93|1.19|\{color:red}13%\{color}-\{color:green}6%\{color}| |HighTermMonthSort|19.03|2.55|18.40|1.56|\{color:red}21%\{color}-\{color:green}21%\{color}| |Prefix3|12.47|0.82|12.10|0.78|\{color:red}14%\{color}-\{color:green}10%\{color}| |LowTerm|182.95|14.94|177.97|18.67|\{color:red}19%\{color}-\{color:green}17%\{color}| |IntNRQ|5.21|0.54|5.09|0.56|\{color:red}21%\{color}-\{color:green}21%\{color}| |MedTerm|90.74|3.99|89.14|4.24|\{color:red}10%\{color}-\{color:green}7%\{color}| |HighTerm|42.54|1.95|41.86|2.00|\{color:red}10%\{color}-\{color:green}8%\{color}| |OrNotHighLow|532.96|16.16|526.86|24.40|\{color:red}8%\{color}-\{color:green}6%\{color}| |HighSloppyPhrase|12.00|0.39|11.90|0.48|\{color:red}7%\{color}-\{color:green}6%\{color}| |OrNotHighMed|53.64|1.08|53.37|1.22|\{color:red}4%\{color}-\{color:green}3%\{color}| |MedSloppyPhrase|31.83|0.59|31.67|0.78|\{color:red}4%\{color}-\{color:green}3%\{color}| |HighPhrase|32.24|0.85|32.09|0.81|\{color:red}5%\{color}-\{color:green}4%\{color}| |LowSloppyPhrase|29.51|0.43|29.40|0.58|\{color:red}3%\{color}-\{color:green}3%\{color}| |AndHighHigh|26.97|0.31|26.88|0.37|\{color:red}2%\{color}-\{color:green}2%\{color}| |MedPhrase|4.95|0.16|4.94|0.15|\{color:red}6%\{color}-\{color:green}6%\{color}| |AndHighMed|50.03|0.72|49.97|0.72|\{color:red}2%\{color}-\{color:green}2%\{color}| |OrNotHighHigh|18.85|0.76|18.85|0.82|\{color:red}8%\{color}-\{color:green}8%\{color}| |OrHighNotHigh|9.35|0.32|9.35|0.35|\{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighLow|15.85|0.59|15.85|0.52|\{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighNotLow|17.56|0.71|17.57|0.70|\{color:red}7%\{color}-\{color:green}8%\{color}| |AndHighLow|284.39|4.41|284.60|5.65|\{color:red}3%\{color}-\{color:green}3%\{color}| |LowPhrase|224.73|4.35|224.97|4.84|\{color:red}3%\{color}-\{color:green}4%\{color}| |OrHighNotMed|13.21|0.49|13.22|0.50|\{color:red}7%\{color}-\{color:green}7%\{color}| |OrHighMed|13.22|0.73|13.30|0.70|\{color:red}9%\{color}-\{color:green}12%\{color}| |OrHighHigh|7.56|0.43|7.62|0.41|\{color:red}9%\{color}-\{color:green}12%\{color}| |BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|\{color:red}36%\{color}-\{color:green}63%\{color}| |LowSpanNear|11.84|0.19|11.99|0.21|\{color:red}2%\{color}-\{color:green}4%\{color}| |HighTermDayOfYearSort|20.05|1.40|20.31|2.15|\{color:red}15%\{color}-\{color:green}20%\{color}| |BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|\{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|\{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|\{color:red}36%\{color}-\{color:green}64%\{color}| |MedSpanNear|10.50|0.18|10.67|0.21|\{color:red}2%\{color}-\{color:green}5%\{color}| |BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|\{color:red}35%\{color}-\{color:green}62%\{color}| |HighSpanNear|8.68|0.19|8.88|0.19|\{color:red}2%\{color}-\{color:green}6%\{color}| was (Author: bruno.roustant): It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h indexing initially - a little less for UniformSplit, then I had an exception about facets). Then I got results which surprised me. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I didn't understand why a different behavior between a small and a large index. Then I thought about 2 explanations: * Much larger index could mean less OS IO cache hits. I ran the benchmark with a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my test. * Much larger index could mean more results. So the time spent to score and rank the results could become much larger and diminish the effect of a change in the dictionary. I have no clue there at the moment. Here is the result of wikimedimall on a 64 GB desktop: (I used -Jira option, but it does not seem to recognize the \{color} tag) ||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff|| |Fuzzy1|72.81|3.11|21.77|0.71|{color:red}72%\{color}-\{color:red}67%\{color}| |Fuzzy2|66.77|3.77|20.41|0.67|{color:red}72%\{color}-\{color:red}66%\{color}| |Respell|8.85|0.64|6.02|0.33|{color:red}40%\{color}-\{color:red}22%\{color}| |PKLookup|130.83|3.96|121.66|12.37|{color:red}18%\{color}-\{color:green}5%\{color}| |Wildcard|25.03|1.33|23.93|1.19|{color:red}13%\{color}-\{color:green}6%\{color}| |HighTermMonthSort|19.03|2.55|18.40|1.56|{color:red}21%\{color}-\{color:green}21%\{color}| |Prefix3|12.47|0.82|12.10|0.78|{color:red}14%\{color}-\{color:green}10%\{color}| |LowTerm|182.95|14.94|177.97|18.67|{color:red}19%\{color}-\{color:green}17%\{color}| |IntNRQ|5.21|0.54|5.09|0.56|{color:red}21%\{color}-\{color:green}21%\{color}| |MedTerm|90.74|3.99|89.14|4.24|{color:red}10%\{color}-\{color:green}7%\{color}| |HighTerm|42.54|1.95|41.86|2.00|{color:red}10%\{color}-\{color:green}8%\{color}| |OrNotHighLow|532.96|16.16|526.86|24.40|{color:red}8%\{color}-\{color:green}6%\{color}| |HighSloppyPhrase|12.00|0.39|11.90|0.48|{color:red}7%\{color}-\{color:green}6%\{color}| |OrNotHighMed|53.64|1.08|53.37|1.22|{color:red}4%\{color}-\{color:green}3%\{color}| |MedSloppyPhrase|31.83|0.59|31.67|0.78|{color:red}4%\{color}-\{color:green}3%\{color}| |HighPhrase|32.24|0.85|32.09|0.81|{color:red}5%\{color}-\{color:green}4%\{color}| |LowSloppyPhrase|29.51|0.43|29.40|0.58|{color:red}3%\{color}-\{color:green}3%\{color}| |AndHighHigh|26.97|0.31|26.88|0.37|{color:red}2%\{color}-\{color:green}2%\{color}| |MedPhrase|4.95|0.16|4.94|0.15|{color:red}6%\{color}-\{color:green}6%\{color}| |AndHighMed|50.03|0.72|49.97|0.72|{color:red}2%\{color}-\{color:green}2%\{color}| |OrNotHighHigh|18.85|0.76|18.85|0.82|{color:red}8%\{color}-\{color:green}8%\{color}| |OrHighNotHigh|9.35|0.32|9.35|0.35|{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighLow|15.85|0.59|15.85|0.52|{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighNotLow|17.56|0.71|17.57|0.70|{color:red}7%\{color}-\{color:green}8%\{color}| |AndHighLow|284.39|4.41|284.60|5.65|{color:red}3%\{color}-\{color:green}3%\{color}| |LowPhrase|224.73|4.35|224.97|4.84|{color:red}3%\{color}-\{color:green}4%\{color}| |OrHighNotMed|13.21|0.49|13.22|0.50|{color:red}7%\{color}-\{color:green}7%\{color}| |OrHighMed|13.22|0.73|13.30|0.70|{color:red}9%\{color}-\{color:green}12%\{color}| |OrHighHigh|7.56|0.43|7.62|0.41|{color:red}9%\{color}-\{color:green}12%\{color}| |BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|{color:red}36%\{color}-\{color:green}63%\{color}| |LowSpanNear|11.84|0.19|11.99|0.21|{color:red}2%\{color}-\{color:green}4%\{color}| |HighTermDayOfYearSort|20.05|1.40|20.31|2.15|{color:red}15%\{color}-\{color:green}20%\{color}| |BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|{color:red}36%\{color}-\{color:green}64%\{color}| |MedSpanNear|10.50|0.18|10.67|0.21|{color:red}2%\{color}-\{color:green}5%\{color}| |BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|{color:red}35%\{color}-\{color:green}62%\{color}| |HighSpanNear|8.68|0.19|8.88|0.19|{color:red}2%\{color}-\{color:green}6%\{color}| > New PostingFormat - UniformSplit > -------------------------------- > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Affects Versions: 8.0 > Reporter: Bruno Roustant > Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org