[ https://issues.apache.org/jira/browse/LUCENE-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-2410: --------------------------------------- Attachment: LUCENE-2410.patch Attached initial rough patch, doing the 1st and 3rd bullets above. Still many nocommits, but all tests pass. I only did this for the exact case (I don't understand the sloppy case!), so I modified ExactPhraseScorer to no longer subclass PhraseScorer and instead do everything on its own. I tested on a 20M doc Wikipedia index, best of 10 runs: ||Query||No. hits||Trunk QPS||Patch QPS||Speedup|| |United States|314K|4.29|11.04|2.6X faster| |United Kingdom Parliament|7K|20.33|58.57|2.9X faster| The speedup is great :) However, there's one problem w/ the patch that I must fix (and will bring these gains down), which is it requires 2 int arrays sized to the max position encountered during the search (which for a large doc could be very large). I think to make this committable I'd have to switch to processing the positions in chunks (like BooleanScorer). > Optimize PhraseQuery > -------------------- > > Key: LUCENE-2410 > URL: https://issues.apache.org/jira/browse/LUCENE-2410 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2410.patch, LUCENE-2410_rewrite.patch > > > Looking the scorers for PhraseQuery, I think there are some speedups > we could do: > * The AND part of the scorer (which advances to the next doc that > has all the terms), in PhraseScorer.doNext, should do the same > optimizing as BooleanQuery's ConjunctionScorer, ie sort terms from > rarest to most frequent. I don't think it should use a linked > list/firstToLast() that it does today. > * We do way too much work now when .score() is not called, because > we go and find all occurrences of the phrase in the doc, whereas > we should stop only after finding the first and then go and count > the rest if .score() is called. > * For the exact case, I think we can use two int arrays to find the > matches. The first array holds the count of how many times a term > in the phrase "matched" a phrase starting at that position. When > that count == the number of terms in the phrase, it's a match. > The 2nd is a "gen" array (holds docID when that count was last > touched), to avoid clearing. Ie when incrementing the count, if > the docID != gen, we reset count to 0. I think this'd be faster > than the PQ we now use. Downside of this is if you have immense > docs (position gets very large) we'd need 2 immense arrays. > It'd be great to do LUCENE-1252 along with this, ie factor > PhraseScorer into two AND'd sub-scorers (LUCENE-1252 is open for > this). The first one should be ConjunctionScorer, and the 2nd one > checks the positions (ie, either the exact or sloppy scorers). This > would mean if the PhraseQuery is AND'd w/ other clauses (or, a filter > is applied) we would save CPU by not checking the positions for a doc > unless all other AND'd clauses accepted the doc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org