To answer partially my question, one key difference is in DefaultBulkScorer:
Lucene 4.10 public boolean score(Collector collector, int max) throws IOException { ... if (max == DocIdSetIterator.NO_MORE_DOCS) { scoreAll(collector, scorer); return false; } ... } Lucene 5.4 public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException { ... if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) { scoreAll(collector, scorer, twoPhase, acceptDocs); return DocIdSetIterator.NO_MORE_DOCS; } The condition to execute a scoreAll verses a scoreRange is more stringent in Lucene 5.4. In my case, scorer.docID() is not equal to -1 at the start, so the code path will execute a scoreRange instead, which does not provide the same behavior as scoreAll. Should a scorer now return a -1 when in some initialized state? Ivan On Wed, Dec 30, 2015 at 6:17 PM, Ivan Brusic <i...@brusic.com> wrote: > I potentially found the issue, but I am wondering why the code worked in > the first place. Did the contract for the scorer change with Lucene 5? > > The issue was that underneath, each sub scorer had a posting enum and the > initial document was not consumed on the first pass. Inside the > DefaultBulkScorer, you have: > > int doc = scorer.docID(); > ... > return scoreRange(collector, scorer, twoPhase, acceptDocs, doc, max); > > So the first document is retrieved outside of the custom scorer. Inside > the custom scorer base class, I had to add something like the code below to > consume that first document: > > if (firstTime) { > ... > // new code > for (Scorer scorer: subScorers) { > if (scorer.docID() == initialDoc) { > scorer.nextDoc(); > } > } > ... > } > > I never wrote a custom scorer before (now that I see the power, I want to > write my own!), so I am not sure how the existing code worked in Lucene 4. > What I am confused is why does each subscorer need to consume their first > document before being used: > > public void add(Scorer scorer) throws IOException { > if (scorer.nextDoc() != NO_MORE_DOCS) { // Initialize and retain only > if it produces docs > subScorers.add(scorer); > } > } > > The nextDoc() call advances the docBufferUpto pointer in the posting enum > to the second document. Code does not work without call nextDoc() initially > on each subscorer. Very confusing. > > Although the existing unit test cases pass, I am still not confident about > the code. Will write a few more test cases, but ultimately why the code > exists in the first place and potentially replace it with base classes. > > Ivan > > On Tue, Dec 29, 2015 at 7:01 AM, Ivan Brusic <i...@brusic.com> wrote: > >> Thanks Adrien. I added the BaseScorer to the gist, but I was hoping to >> achieve was which direction I should go into to debug this issue. I was not >> focusing on the scorers since I did not need to upgrade them and I actually >> do not think I ever wrote my one Scorer in Lucene. Taking the next few days >> off, so I will get around to looking back into it soon. >> >> Ivan >> >> On Mon, Dec 28, 2015 at 5:41 PM, Adrien Grand <jpou...@gmail.com> wrote: >> >>> Ivan, I can't find the BaseScorer class in the gist. Maybe you forgot to >>> git add it? >>> >>> Le lun. 28 déc. 2015 à 23:07, Ivan Brusic <i...@brusic.com> a écrit : >>> >>> > Here is the complete code: >>> > https://gist.github.com/brusic/e3018a2e403f5707fa3e >>> > >>> > The code is not originally mine, so I do not take responsibility. Once >>> I >>> > get things to perform correctly, I will do another pass with >>> improvements. >>> > Much of the custom code needs to be re-thought. >>> > >>> > The scorer is one class that I did not need to update, so I did not >>> focus >>> > on it. Will do so now. >>> > >>> > Ivan >>> > >>> > On Mon, Dec 28, 2015 at 4:58 PM, Adrien Grand <jpou...@gmail.com> >>> wrote: >>> > >>> > > Hi Ivan, >>> > > >>> > > It looks like your scorer is emitting the same document twice. Maybe >>> you >>> > > could try to use AssertingIndexSearcher in your test case, this is >>> the >>> > kind >>> > > of things that it should catch. >>> > > >>> > > The only related Lucene 5 change that I can think of is that Lucene >>> now >>> > > requires docs to be collected in order, did this scorer use to >>> collect >>> > docs >>> > > out of order in Lucene 4? >>> > > >>> > > If that still doesn't help and if you can share the code of your >>> scorer, >>> > I >>> > > could give it a quick look. >>> > > >>> > > Le lun. 28 déc. 2015 à 22:18, Ivan Brusic <i...@brusic.com> a écrit >>> : >>> > > >>> > > > I just migrated on ton of code from Lucene 4.10 to 5.4. Lots of >>> custom >>> > > > collectors, analyzers, queries, etc.. I have migrated other code >>> bases >>> > > from >>> > > > Lucene before (2->3, 3->4) and I always had one issue I could not >>> > > eyeball! >>> > > > >>> > > > When using a custom query, I get the same document twice in the >>> result >>> > > set. >>> > > > The changes I made for the upgrade had to do with the query/weight >>> API >>> > > > change. >>> > > > >>> > > > Without getting in the custom code, here is the simple test case: >>> > > > >>> > > > @BeforeClass >>> > > > public static void buildIndex() throws IOException { >>> > > > ANALYZER = new StandardAnalyzer(); >>> > > > IndexWriterConfig config = new IndexWriterConfig(ANALYZER); >>> > > > DIRECTORY = new RAMDirectory(); >>> > > > try (IndexWriter writer = new IndexWriter(DIRECTORY, config)) { >>> > > > // removed for brevity >>> > > > // repeated five times with different values >>> > > > Document doc = new Document(); >>> > > > doc.add(...); >>> > > > writer.addDocument(doc); >>> > > > } >>> > > > } >>> > > > >>> > > > @Test >>> > > > public void testQuery() throws IOException { >>> > > > try (IndexReader reader = DirectoryReader.open(DIRECTORY)) { >>> > > > IndexSearcher searcher = new IndexSearcher(reader); >>> > > > >>> > > > PriorityQuery query = new PriorityQuery(); >>> > > > query.add(new TermQuery(new Term("foo", "xyz"))); >>> > > > query.add(new TermQuery(new Term("bar", "xyz"))); >>> > > > query.add(new TermQuery(new Term("baz", "xyz"))); >>> > > > >>> > > > CheckHits.checkDocIds("Invalid docs", new int[] {4, 2, 0, >>> 3}, >>> > > > result.scoreDocs); >>> > > > >>> > > > } >>> > > > >>> > > > There should be four unique results out of five since the second >>> > > > document (docId 1) does not contain the term xyz. The results >>> instead >>> > > > contain 5 documents, with the first one repeated twice at the >>> start: >>> > > > >>> > > > [doc=4 score=1.1976817 shardIndex=0, doc=4 score=1.1976817 >>> > > > shardIndex=0, doc=2 score=0.63170385 shardIndex=0, doc=0 >>> > > > score=0.37223506 shardIndex=0, doc=3 score=0.34156355 shardIndex=0] >>> > > > >>> > > > When using a BooleanQuery, the results are correct, so obviously >>> the >>> > > > custom Query is failing somehow. In all my years of Lucene, I never >>> > > > had the same document twice. :) Without boring everyone with the >>> > > > custom code, what should I be looking for? Just cannot quite spot >>> it. >>> > > > >>> > > > Cheers, >>> > > > >>> > > > Ivan >>> > > > >>> > > >>> > >>> >> >> >