some mistakes of the example: after first call advance(5) currentDoc=6 first scorer's nextDoc is called to in advance, the heap is empty now. then call advance(6) because scorerDocQueue.size() < minimumNrMatchers, it just return NO_MORE_DOCS
On Tue, Apr 17, 2012 at 6:37 PM, Li Li <fancye...@gmail.com> wrote: > hi all, > I am now hacking the BooleanScorer2 to let it keep the docID() of the > leaf scorer(mostly possible TermScorer) the same as the top-level Scorer. > Why I want to do this is: When I Collect a doc, I want to know which term > is matched(especially for BooleanClause whose Occur is SHOULD). we have > discussed some solutions, such as adding bit masks in disjunction scorers. > with this method, when we finds a matched doc, we can recursively find > which leaf scorer is matched. But we think it's not very efficient and not > convenient to use(this is my proposal but not agreed by others in our > team). and then we came up with another one: Modifying DisjunctionSumScorer. > we analysed the codes and found that the only Scorers used by > BooleanScorer2 that will make the children scorers' docID() not equal to > parent is an anonymous class inherited from DisjunctionSumScorer. All other > ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous), > dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need. > The implementation algorithm of DisjunctionSumScorer use a heap to find > the smallest doc. after finding a matched doc, the currentDoc is the > matched doc and all the scorers in the heap will call nextDoc() so all of > the scorers' current docID the nextDoc of currentDoc. if there are N level > DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId > of the root of the scorer tree. > So we modify the DisjuctionSumScorer and let it behavior as we > expected. And then I wrote some TestCase and it works well. And also I > wrote some random generated TermScorer and compared the nextDoc(),score() > and advance(int) method of original DisjunctionSumScorer and modified one. > nextDoc() and score() and exactly the same. But for advance(int target), we > found some interesting and strange things. > at the beginning, I think if target is less than current docID, it will > just return current docID and do nothing. this assumption let my algorithm > go wrong. Then I read the codes of TermScorer and found each call of > advance(int) of TermScorer will call nextDoc() no matter whether current > docID is larger than target or not. > So I am confused and then read the javadoc of DocIdSetIterator: > ----------------- javadoc of DocIdSetIterator.advance(int > target)------------- > > int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws > IOException > > Advances to the first beyond (see NOTE below) the current whose document > number is greater than or equal > to target. Returns the current document number or NO_MORE_DOCS if there > are no more docs in the set. > Behaves as if written: > int advance(int target) { > int doc; > while ((doc = nextDoc()) < target) { > } > return doc; > } > Some implementations are considerably more efficient than that. > NOTE: when target < current implementations may opt not to advance beyond > their current docID(). > NOTE: this method may be called with NO_MORE_DOCS for efficiency by some > Scorers. If your > implementation cannot efficiently determine that it should exhaust, it is > recommended that you check for > that value in each call to this method. > NOTE: after the iterator has exhausted you should not call this method, as > it may result in unpredicted > behavior. > -------------------------------------- > Then I modified my algorithm again and found that > DisjunctionSumScorer.advance(int target) has some strange behavior. most of > the cases, it will return currentDoc if target < currentDoc. but in some > boundary condition, it will not. > it's not a bug but let me sad. I thought my algorithm has some bug because > it's advance method is not exactly the same as original > DisjunctionSumScorer's. > ----codes of DisjunctionSumScorer--- > @Override > public int advance(int target) throws IOException { > if (scorerDocQueue.size() < minimumNrMatchers) { > return currentDoc = NO_MORE_DOCS; > } > if (target <= currentDoc) { > return currentDoc; > } > .... > ------------------- > for most case if (target <= currentDoc) it will return currentDoc; > but if previous advance will make sub scorers exhausted, then if may > return NO_MORE_DOCS > an example is: > currentDoc=-1 > minimumNrMatchers=1 > subScorers: > TermScorer: docIds: [1, 2, 6] > TermScorer: docIds: [2, 4] > after first call advance(5) > currentDoc=6 > only first scorer is now in the heap, scorerDocQueue.size()==1 > then call advance(6) > because scorerDocQueue.size() < minimumNrMatchers, it just return > NO_MORE_DOCS > > My question is why the advance(int target) method is defined like this? > for the reason of efficient or any other reasons? > >