[ http://issues.apache.org/jira/browse/LUCENE-323?page=all ]

Chuck Williams updated LUCENE-323:
----------------------------------

    Attachment: dms.tar.gz

The attached archive contains a revised DisjunctionMaxScorer that maintains the 
disjunct scorers as a min heap instead of a sorted list.  This reduces the time 
per next() to O(k*log(n)) instead of O(k*n) per Paul's earlier comment.  Most 
of the class changed, so I included both a patch and the new class.  This is 
only lightly tested; the junit test passes, along with the entire Lucene test 
suite.  I'm not working on the project anymore that led to the original class 
and so have not tested it on that.  I'm working on a new project that will use 
this and so it will get thoroughly tested there, but am not yet to an 
appropriate point.  I thought it was best to post the patch now as I believe it 
is correct and the unit test does pass.  Perhaps others would like to try it 
out.  E.g., it would be interesting to run the performance test that Paul 
mentions.

Also, I found and fixed another bug while updating the class.  In the current 
committed version, there is a problem if skipTo() exhausts all the scorers.  It 
did not set more to false, and so a subsequent call to next() would attempt to 
access the non-existent first scorer.

It would be nice to get some form of DistributingMultiFieldQueryParser in so 
that this is easy to use.

Thanks to Yonik for committing this functionality!

Chuck


> [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate 
> support for queries across multiple fields
> -----------------------------------------------------------------------------------------------------------------
>
>          Key: LUCENE-323
>          URL: http://issues.apache.org/jira/browse/LUCENE-323
>      Project: Lucene - Java
>         Type: Bug
>   Components: QueryParser
>     Versions: 1.4
>  Environment: Operating System: Windows XP
> Platform: PC
>     Reporter: Chuck Williams
>     Assignee: Lucene Developers
>  Attachments: DisjunctionMaxQuery.java, DisjunctionMaxScorer.java, 
> TestDisjunctionMaxQuery.java, TestMaxDisjunctionQuery.java, TestRanking.zip, 
> TestRanking.zip, TestRanking.zip, WikipediaSimilarity.java, 
> WikipediaSimilarity.java, WikipediaSimilarity.java, dms.tar.gz
>
> The attached test case demonstrates this problem and provides a fix:
>   1.  Use a custom similarity to eliminate all tf and idf effects, just to 
> isolate what is being tested.
>   2.  Create two documents doc1 and doc2, each with two fields title and 
> description.  doc1 has "elephant" in title and "elephant" in description.  
> doc2 has "elephant" in title and "albino" in description.
>   3.  Express query for "albino elephant" against both fields.
> Problems:
>       a.  MultiFieldQueryParser won't recognize either document as containing 
> both terms, due to the way it expands the query across fields.
>       b.  Expressing query as "title:albino description:albino title:elephant 
> description:elephant" will score both documents equivalently, since each 
> matches two query terms.
>   4.  Comparison to MaxDisjunctionQuery and my method for expanding queries 
> across fields.  Using notation that () represents a BooleanQuery and ( | ) 
> represents a MaxDisjunctionQuery, "albino elephant" expands to:
>         ( (title:albino | description:albino)
>           (title:elephant | description:elephant) )
> This will recognize that doc2 has both terms matched while doc1 only has 1 
> term matched, score doc2 over doc1.
> Refinement note:  the actual expansion for "albino query" that I use is:
>         ( (title:albino | description:albino)~0.1
>           (title:elephant | description:elephant)~0.1 )
> This causes the score of each MaxDisjunctionQuery to be the score of highest 
> scoring MDQ subclause plus 0.1 times the sum of the scores of the other MDQ 
> subclauses.  Thus, doc1 gets some credit for also having "elephant" in the 
> description but only 1/10 as much as doc2 gets for covering another query 
> term 
> in its description.  If doc3 has "elephant" in title and both "albino" 
> and "elephant" in the description, then with the actual refined expansion, it 
> gets the highest score of all (whereas with pure max, without the 0.1, it 
> would get the same score as doc2).
> In real apps, tf's and idf's also come into play of course, but can affect 
> these either way (i.e., mitigate this fundamental problem or exacerbate it).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to