[ http://issues.apache.org/jira/browse/LUCENE-736?page=all ]

Doron Cohen updated LUCENE-736:
-------------------------------

    Attachment: sloppy_phrase.patch2.txt
                res-search-orig2.log
                res-search-new2.log

The change to fix case 2 was not the main performance degradation cause.

I agree with Yonik that case 2 is much more important than case 1.
So I modified to fix case 2 but not case 1. 
Also extended the perf test to create also the "reversed" form of the sloppy 
phrases (slop increased for reversed cases so that queries would match docs).

Cost of this fix dropped from 15% more CPU time to about 3%. 
I feel ok with this.

.....Operation..........runCnt...recsPerRun...rec/s..elapsedSec....avgUsedMem....avgTotalMem
Orig.SearchSameRdr_6000......4.........6000...194.2......123.59.....8,032,732.....11,333,632
New..SearchSameRdr_6000......4.........6000...187.5......128.02.....8,172,258.....11,333,632

Attached sloppy_phrase.patch2.txt has the updated fix, including both java and 
test parts. Some of the asserts in the new tests were commented out cause the 
patch takes decision not to fix case 1 above.

Also attaching the updates perf test logs - res-search-orig2.log and 
res-search-new2.log.

I did not compare scoring of similar cases between sloppy phrase and near spans 
and Paul suggested - perhaps next week - not sure this should hold progress 
with this issue.

> Sloppy Phrase Scoring Misbehavior
> ---------------------------------
>
>                 Key: LUCENE-736
>                 URL: http://issues.apache.org/jira/browse/LUCENE-736
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>            Reporter: Doron Cohen
>         Assigned To: Doron Cohen
>            Priority: Minor
>         Attachments: perf-search-new.log, perf-search-orig.log, 
> res-search-new2.log, res-search-orig2.log, sloppy_phrase.patch2.txt, 
> sloppy_phrase_java.patch.txt, sloppy_phrase_tests.patch.txt
>
>
> This is an extension of https://issues.apache.org/jira/browse/LUCENE-697
> In addition to abnormalities Yonik pointed out in 697, there seem to be other 
> issues with slopy phrase search and scoring.
> 1) A phrase with a repeated word would be detected in a document although it 
> is not there.
> I.e. document = A B D C E , query = "B C B" would not find this document (as 
> expected), but query "B C B"~2 would find it. 
> I think that no matter how large the slop is, this document should not be a 
> match.
> 2) A document containing both orders of a query, symmetrically, would score 
> differently for the queru and for its reveresed form.
> I.e. document = A B C B A would score differently for queries "B C"~2 and "C 
> B"~2, although it is symmetric to both.
> I will attach test cases that show both these problems and the one reported 
> by Yonik in 697. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to