[ http://issues.apache.org/jira/browse/LUCENE-697?page=all ]

Doron Cohen updated LUCENE-697:
-------------------------------

    Attachment: sloppy_phrase_skipTo.patch

This was tricky, for me anyhow, but I think I found it.

The difference in scoring between using next() to using skipTo() (or a 
combination of these two) is caused by two (valid) orders of the sorted 
PhrasePositions. 

Currently PhrasePositions sorting is defined by doc and position, where 
position already considers the offset of the term within the (phrase) query. 

If however two TermPosition have the same doc and same position, the sort takes 
no decision, which falls down to one valid sort (by current sort definition). 
The difference between using next() and skipTo() in this regard is that 
skipTo() always calls sort(), sorting the entire set, while next() only calls 
sort() at initialization and then maintain the sorting as part of the scoring 
process. 

This would be clearer with the following example - taken from Yonik's test case 
that is failing now:
   - Doc1:     w1 w3 w2 w3 zz
   - Query:   "w3 w2"~2
When starting scoring in this doc, both PhrasePositions pp(w3) and pp(w2) have 
doc(2)=doc(w3)=1.
Note, that, for the second w3 that matches we would have pos(w2)=2+1=3 and 
pos(w3)=3+0=3. 

So, after scoring doc1("w3 w2"), if the sort result places pp(w2) at the top, 
we would also score for doc1("w3 w2"). However, if pp(w3) is placed by the sort 
at the top (==smallest), we would not score also for doc1("w3 w2"). 

Current behavior is inconsistent: skip() would take the two while next() won't, 
and I think it is possible to create a case where it would be the other way 
around. So definitely behavior should be made consistent. 

Next question to be asked is: Do we want to sum (or max) the frequency for both 
(or more cases)? I think yes, sum. 

To fix this I am changing PhrasePosition comparison, so that in case positions 
are equal, the actual doc position (ignoring offset in query phrase) is 
considered. 

In addition, I added missing calls to clear the priority queue before starting 
to sort and to mark that no more initialization is required when skipTo() is 
called. 

I tested with the sequence that Yonik added:
    - skip skip next next skip skip 
And also with the sequences:
    - skip skip skip skip skip skip
    - next next next next next next 
    - skip next skip next skip next 
    - next skip next skip next skip
    - next next skip skip next next
The latter 5 cases are now commented out, the first case is in effect.

This scoring code is still not feeling natural to me, so (actually as always) 
comments will be appreciated.

- Doron

> Scorer.skipTo affects sloppyPhrase scoring
> ------------------------------------------
>
>                 Key: LUCENE-697
>                 URL: http://issues.apache.org/jira/browse/LUCENE-697
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.0.0
>            Reporter: Yonik Seeley
>         Assigned To: Doron Cohen
>         Attachments: sloppy_phrase_skipTo.patch
>
>
> If you mix skipTo() and next(), you get different scores than what is 
> returned to a hit collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to