Re: Duplicate values in search

Ivan Brusic Wed, 30 Dec 2015 19:22:25 -0800

To answer partially my question, one key difference is in DefaultBulkScorer:


Lucene 4.10

    public boolean score(Collector collector, int max) throws IOException {
      ...
      if (max == DocIdSetIterator.NO_MORE_DOCS) {
        scoreAll(collector, scorer);
        return false;
      }
      ...
    }

Lucene 5.4
    public int score(LeafCollector collector, Bits acceptDocs, int min, int
max) throws IOException {
      ...
      if (scorer.docID() == -1 && min == 0 && max ==
DocIdSetIterator.NO_MORE_DOCS) {
        scoreAll(collector, scorer, twoPhase, acceptDocs);
        return DocIdSetIterator.NO_MORE_DOCS;
      }

The condition to execute a scoreAll verses a scoreRange is more stringent
in Lucene 5.4.  In my case, scorer.docID() is not equal to -1 at the start,
so the code path will execute a scoreRange instead, which does not provide
the same behavior as scoreAll.

Should a scorer now return a -1 when in some initialized state?

Ivan

On Wed, Dec 30, 2015 at 6:17 PM, Ivan Brusic <i...@brusic.com> wrote:

> I potentially found the issue, but I am wondering why the code worked in
> the first place. Did the contract for the scorer change with Lucene 5?
>
> The issue was that underneath, each sub scorer had a posting enum and the
> initial document was not consumed on the first pass.  Inside the
> DefaultBulkScorer, you have:
>
> int doc = scorer.docID();
> ...
> return scoreRange(collector, scorer, twoPhase, acceptDocs, doc, max);
>
> So the first document is retrieved outside of the custom scorer. Inside
> the custom scorer base class, I had to add something like the code below to
> consume that first document:
>
> if (firstTime) {
>     ...
>     // new code
>     for (Scorer scorer: subScorers) {
>         if (scorer.docID() == initialDoc) {
>             scorer.nextDoc();
>         }
>     }
>     ...
> }
>
> I never wrote a custom scorer before (now that I see the power, I want to
> write my own!), so I am not sure how the existing code worked in Lucene 4.
> What I am confused is why does each subscorer need to consume their first
> document before being used:
>
> public void add(Scorer scorer) throws IOException {
>     if (scorer.nextDoc() != NO_MORE_DOCS) { // Initialize and retain only
> if it produces docs
>         subScorers.add(scorer);
>     }
> }
>
> The nextDoc() call advances the docBufferUpto pointer in the posting enum
> to the second document. Code does not work without call nextDoc() initially
> on each subscorer. Very confusing.
>
> Although the existing unit test cases pass, I am still not confident about
> the code. Will write a few more test cases, but ultimately why the code
> exists in the first place and potentially replace it with base classes.
>
> Ivan
>
> On Tue, Dec 29, 2015 at 7:01 AM, Ivan Brusic <i...@brusic.com> wrote:
>
>> Thanks Adrien. I added the BaseScorer to the gist, but I was hoping to
>> achieve was which direction I should go into to debug this issue. I was not
>> focusing on the scorers since I did not need to upgrade them and I actually
>> do not think I ever wrote my one Scorer in Lucene. Taking the next few days
>> off, so I will get around to looking back into it soon.
>>
>> Ivan
>>
>> On Mon, Dec 28, 2015 at 5:41 PM, Adrien Grand <jpou...@gmail.com> wrote:
>>
>>> Ivan, I can't find the BaseScorer class in the gist. Maybe you forgot to
>>> git add it?
>>>
>>> Le lun. 28 déc. 2015 à 23:07, Ivan Brusic <i...@brusic.com> a écrit :
>>>
>>> > Here is the complete code:
>>> > https://gist.github.com/brusic/e3018a2e403f5707fa3e
>>> >
>>> > The code is not originally mine, so I do not take responsibility. Once
>>> I
>>> > get things to perform correctly, I will do another pass with
>>> improvements.
>>> > Much of the custom code needs to be re-thought.
>>> >
>>> > The scorer is one class that I did not need to update, so I did not
>>> focus
>>> > on it. Will do so now.
>>> >
>>> > Ivan
>>> >
>>> > On Mon, Dec 28, 2015 at 4:58 PM, Adrien Grand <jpou...@gmail.com>
>>> wrote:
>>> >
>>> > > Hi Ivan,
>>> > >
>>> > > It looks like your scorer is emitting the same document twice. Maybe
>>> you
>>> > > could try to use AssertingIndexSearcher in your test case, this is
>>> the
>>> > kind
>>> > > of things that it should catch.
>>> > >
>>> > > The only related Lucene 5 change that I can think of is that Lucene
>>> now
>>> > > requires docs to be collected in order, did this scorer use to
>>> collect
>>> > docs
>>> > > out of order in Lucene 4?
>>> > >
>>> > > If that still doesn't help and if you can share the code of your
>>> scorer,
>>> > I
>>> > > could give it a quick look.
>>> > >
>>> > > Le lun. 28 déc. 2015 à 22:18, Ivan Brusic <i...@brusic.com> a écrit
>>> :
>>> > >
>>> > > > I just migrated on ton of code from Lucene 4.10 to 5.4. Lots of
>>> custom
>>> > > > collectors, analyzers, queries, etc.. I have migrated other code
>>> bases
>>> > > from
>>> > > > Lucene before (2->3, 3->4) and I always had one issue I could not
>>> > > eyeball!
>>> > > >
>>> > > > When using a custom query, I get the same document twice in the
>>> result
>>> > > set.
>>> > > > The changes I made for the upgrade had to do with the query/weight
>>> API
>>> > > > change.
>>> > > >
>>> > > > Without getting in the custom code, here is the simple test case:
>>> > > >
>>> > > > @BeforeClass
>>> > > > public static void buildIndex() throws IOException {
>>> > > >     ANALYZER = new StandardAnalyzer();
>>> > > >     IndexWriterConfig config = new IndexWriterConfig(ANALYZER);
>>> > > >     DIRECTORY = new RAMDirectory();
>>> > > >     try (IndexWriter writer = new IndexWriter(DIRECTORY, config)) {
>>> > > >         // removed for brevity
>>> > > >         // repeated five times with different values
>>> > > >         Document doc = new Document();
>>> > > >         doc.add(...);
>>> > > >         writer.addDocument(doc);
>>> > > >     }
>>> > > > }
>>> > > >
>>> > > > @Test
>>> > > > public void testQuery() throws IOException {
>>> > > >     try (IndexReader reader = DirectoryReader.open(DIRECTORY)) {
>>> > > >         IndexSearcher searcher = new IndexSearcher(reader);
>>> > > >
>>> > > >         PriorityQuery query = new PriorityQuery();
>>> > > >         query.add(new TermQuery(new Term("foo", "xyz")));
>>> > > >         query.add(new TermQuery(new Term("bar", "xyz")));
>>> > > >         query.add(new TermQuery(new Term("baz", "xyz")));
>>> > > >
>>> > > >         CheckHits.checkDocIds("Invalid docs", new int[] {4, 2, 0,
>>> 3},
>>> > > > result.scoreDocs);
>>> > > >
>>> > > > }
>>> > > >
>>> > > > There should be four unique results out of five since the second
>>> > > > document (docId 1) does not contain the term xyz. The results
>>> instead
>>> > > > contain 5 documents, with the first one repeated twice at the
>>> start:
>>> > > >
>>> > > > [doc=4 score=1.1976817 shardIndex=0, doc=4 score=1.1976817
>>> > > > shardIndex=0, doc=2 score=0.63170385 shardIndex=0, doc=0
>>> > > > score=0.37223506 shardIndex=0, doc=3 score=0.34156355 shardIndex=0]
>>> > > >
>>> > > > When using a BooleanQuery, the results are correct, so obviously
>>> the
>>> > > > custom Query is failing somehow. In all my years of Lucene, I never
>>> > > > had the same document twice. :) Without boring everyone with the
>>> > > > custom code, what should I be looking for? Just cannot quite spot
>>> it.
>>> > > >
>>> > > > Cheers,
>>> > > >
>>> > > > Ivan
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Duplicate values in search

Reply via email to