RE: Duplicate values in search

Uwe Schindler Thu, 31 Dec 2015 01:08:38 -0800

Hi,

from the comments around your original code I have seen that there were 
instructions "how to update the code" (instructions what you need to copy from 
Lucene and what to change to create Base* classes). Maybe you missed to 
copypaste all files. In Lucene 5 there was a major rewrite also of 
Dismax/Dissum scorers.


In older Lucene versions there was an early exit condition outside of the 
scorer in the weight while creating the disjunction. Before initializing the 
Disjunction scorer it asked every sub scorer for the first doc - causing it to 
sit on first doc already. Every scorer that had no first doc (empty one) was 
not added to the main disjunction scorer anymore. Because of this the logic in 
the main disjunction scorer was a bit  non-natural, because the subscoreres 
were all initialized - so Dis*Scorer did not call next() on the first round.

This changed in 5.x and trunk during the code rewrite. The check for empty 
scorers is done in a different way now (as you said in your previous mail), so 
be sure to copy paste really all classes. Because the scorer and weights are 
not a public classes in Lucene it is free to change its internal behaviour.

> The condition to execute a scoreAll verses a scoreRange is more stringent
> in Lucene 5.4.  In my case, scorer.docID() is not equal to -1 at the start,
> so the code path will execute a scoreRange instead, which does not provide
> the same behavior as scoreAll.

It was always required to be -1 at start before the first call to 
next()/advance() - see documentation of DocIdSetIterator/Scorer. But in the 
older Disjunction code never checked for this. AssertingScorer from test 
framework checks all these conditions. The new logic about the pre-check for 
short circuiting the disjunction creating is now different and relies on 
"correct" behaviour of all (sub-)scorers.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Ivan Brusic [mailto:i...@brusic.com]
> Sent: Thursday, December 31, 2015 3:18 AM
> To: java-user@lucene.apache.org
> Subject: Re: Duplicate values in search
> 
> I potentially found the issue, but I am wondering why the code worked in
> the first place. Did the contract for the scorer change with Lucene 5?
> 
> The issue was that underneath, each sub scorer had a posting enum and the
> initial document was not consumed on the first pass.  Inside the
> DefaultBulkScorer, you have:
> 
> int doc = scorer.docID();
> ...
> return scoreRange(collector, scorer, twoPhase, acceptDocs, doc, max);
> 
> So the first document is retrieved outside of the custom scorer. Inside the
> custom scorer base class, I had to add something like the code below to
> consume that first document:
> 
> if (firstTime) {
>     ...
>     // new code
>     for (Scorer scorer: subScorers) {
>         if (scorer.docID() == initialDoc) {
>             scorer.nextDoc();
>         }
>     }
>     ...
> }
> 
> I never wrote a custom scorer before (now that I see the power, I want to
> write my own!), so I am not sure how the existing code worked in Lucene 4.
> What I am confused is why does each subscorer need to consume their first
> document before being used:
> 
> public void add(Scorer scorer) throws IOException {
>     if (scorer.nextDoc() != NO_MORE_DOCS) { // Initialize and retain only
> if it produces docs
>         subScorers.add(scorer);
>     }
> }
> 
> The nextDoc() call advances the docBufferUpto pointer in the posting enum
> to the second document. Code does not work without call nextDoc() initially
> on each subscorer. Very confusing.
> 
> Although the existing unit test cases pass, I am still not confident about
> the code. Will write a few more test cases, but ultimately why the code
> exists in the first place and potentially replace it with base classes.
> 
> Ivan
> 
> On Tue, Dec 29, 2015 at 7:01 AM, Ivan Brusic <i...@brusic.com> wrote:
> 
> > Thanks Adrien. I added the BaseScorer to the gist, but I was hoping to
> > achieve was which direction I should go into to debug this issue. I was not
> > focusing on the scorers since I did not need to upgrade them and I actually
> > do not think I ever wrote my one Scorer in Lucene. Taking the next few
> days
> > off, so I will get around to looking back into it soon.
> >
> > Ivan
> >
> > On Mon, Dec 28, 2015 at 5:41 PM, Adrien Grand <jpou...@gmail.com>
> wrote:
> >
> >> Ivan, I can't find the BaseScorer class in the gist. Maybe you forgot to
> >> git add it?
> >>
> >> Le lun. 28 déc. 2015 à 23:07, Ivan Brusic <i...@brusic.com> a écrit :
> >>
> >> > Here is the complete code:
> >> > https://gist.github.com/brusic/e3018a2e403f5707fa3e
> >> >
> >> > The code is not originally mine, so I do not take responsibility. Once I
> >> > get things to perform correctly, I will do another pass with
> >> improvements.
> >> > Much of the custom code needs to be re-thought.
> >> >
> >> > The scorer is one class that I did not need to update, so I did not
> >> focus
> >> > on it. Will do so now.
> >> >
> >> > Ivan
> >> >
> >> > On Mon, Dec 28, 2015 at 4:58 PM, Adrien Grand <jpou...@gmail.com>
> >> wrote:
> >> >
> >> > > Hi Ivan,
> >> > >
> >> > > It looks like your scorer is emitting the same document twice. Maybe
> >> you
> >> > > could try to use AssertingIndexSearcher in your test case, this is the
> >> > kind
> >> > > of things that it should catch.
> >> > >
> >> > > The only related Lucene 5 change that I can think of is that Lucene
> >> now
> >> > > requires docs to be collected in order, did this scorer use to collect
> >> > docs
> >> > > out of order in Lucene 4?
> >> > >
> >> > > If that still doesn't help and if you can share the code of your
> >> scorer,
> >> > I
> >> > > could give it a quick look.
> >> > >
> >> > > Le lun. 28 déc. 2015 à 22:18, Ivan Brusic <i...@brusic.com> a écrit :
> >> > >
> >> > > > I just migrated on ton of code from Lucene 4.10 to 5.4. Lots of
> >> custom
> >> > > > collectors, analyzers, queries, etc.. I have migrated other code
> >> bases
> >> > > from
> >> > > > Lucene before (2->3, 3->4) and I always had one issue I could not
> >> > > eyeball!
> >> > > >
> >> > > > When using a custom query, I get the same document twice in the
> >> result
> >> > > set.
> >> > > > The changes I made for the upgrade had to do with the
> query/weight
> >> API
> >> > > > change.
> >> > > >
> >> > > > Without getting in the custom code, here is the simple test case:
> >> > > >
> >> > > > @BeforeClass
> >> > > > public static void buildIndex() throws IOException {
> >> > > >     ANALYZER = new StandardAnalyzer();
> >> > > >     IndexWriterConfig config = new IndexWriterConfig(ANALYZER);
> >> > > >     DIRECTORY = new RAMDirectory();
> >> > > >     try (IndexWriter writer = new IndexWriter(DIRECTORY, config)) {
> >> > > >         // removed for brevity
> >> > > >         // repeated five times with different values
> >> > > >         Document doc = new Document();
> >> > > >         doc.add(...);
> >> > > >         writer.addDocument(doc);
> >> > > >     }
> >> > > > }
> >> > > >
> >> > > > @Test
> >> > > > public void testQuery() throws IOException {
> >> > > >     try (IndexReader reader = DirectoryReader.open(DIRECTORY)) {
> >> > > >         IndexSearcher searcher = new IndexSearcher(reader);
> >> > > >
> >> > > >         PriorityQuery query = new PriorityQuery();
> >> > > >         query.add(new TermQuery(new Term("foo", "xyz")));
> >> > > >         query.add(new TermQuery(new Term("bar", "xyz")));
> >> > > >         query.add(new TermQuery(new Term("baz", "xyz")));
> >> > > >
> >> > > >         CheckHits.checkDocIds("Invalid docs", new int[] {4, 2, 0,
> >> 3},
> >> > > > result.scoreDocs);
> >> > > >
> >> > > > }
> >> > > >
> >> > > > There should be four unique results out of five since the second
> >> > > > document (docId 1) does not contain the term xyz. The results
> >> instead
> >> > > > contain 5 documents, with the first one repeated twice at the start:
> >> > > >
> >> > > > [doc=4 score=1.1976817 shardIndex=0, doc=4 score=1.1976817
> >> > > > shardIndex=0, doc=2 score=0.63170385 shardIndex=0, doc=0
> >> > > > score=0.37223506 shardIndex=0, doc=3 score=0.34156355
> shardIndex=0]
> >> > > >
> >> > > > When using a BooleanQuery, the results are correct, so obviously
> the
> >> > > > custom Query is failing somehow. In all my years of Lucene, I never
> >> > > > had the same document twice. :) Without boring everyone with the
> >> > > > custom code, what should I be looking for? Just cannot quite spot
> >> it.
> >> > > >
> >> > > > Cheers,
> >> > > >
> >> > > > Ivan
> >> > > >
> >> > >
> >> >
> >>
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Duplicate values in search

Reply via email to