[ https://issues.apache.org/jira/browse/LUCENE-6421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-6421: -------------------------------- Attachment: LUCENE-6421_luceneutil.patch LUCENE-6421.patch See attached patch and benchmarks modifications / tasks file. * no longer keeps subs "one document ahead", its like a normal disjunction * positions reading/merging are deferred until freq() is called. * general cleanups The problems with the current code is more than just two-phase iteration, because it always reads all positions from all subs on nextDoc()/advance(), it slows down even the simplest multiphrase queries like these added to the tasks file: {noformat} MultiPhraseHHH: multiPhrase//(body:in|of the) MultiPhraseHHM: multiPhrase//(body:in|of your) MultiPhraseHHL: multiPhrase//(body:in|of harvard) MultiPhraseMMH: multiPhrase//(body:northern|southern states) MultiPhraseMMM: multiPhrase//(body:northern|southern usa) MultiPhraseMML: multiPhrase//(body:northern|southern iraq) {noformat} So in the example of northern|southern states, today all positions are read from either or both 'northern' and 'southern', regardless of whether 'states' is present in the doc at all. Filters will only aggravate the situation even more. Benchmarking these is super-slow, but after a few iterations it looks like this: {noformat} Task QPS trunk StdDev QPS patch StdDev Pct diff MultiPhraseHHH 0.34 (2.1%) 0.33 (1.4%) -2.1% ( -5% - 1%) MultiPhraseHHL 17.26 (0.7%) 17.67 (0.5%) 2.3% ( 1% - 3%) MultiPhraseHHM 5.13 (1.6%) 5.34 (0.3%) 4.1% ( 2% - 6%) MultiPhraseMMH 33.99 (1.3%) 39.19 (0.7%) 15.3% ( 13% - 17%) MultiPhraseMML 160.11 (0.2%) 202.29 (0.6%) 26.3% ( 25% - 27%) MultiPhraseMMM 72.20 (1.7%) 95.66 (2.0%) 32.5% ( 28% - 36%) {noformat} > Add two-phase support to MultiPhraseQuery > ----------------------------------------- > > Key: LUCENE-6421 > URL: https://issues.apache.org/jira/browse/LUCENE-6421 > Project: Lucene - Core > Issue Type: Bug > Reporter: Robert Muir > Attachments: LUCENE-6421.patch, LUCENE-6421_luceneutil.patch > > > Two-phase support currently works for both sloppy and exact Scorers but it > does not work if you have multiple terms at the same position > (MultiPhraseQuery). > This is because UnionPostingsEnum.nextDoc() aggressively reads and merges all > the positions. Even making this initialization lazy might just be enough? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org