I have been trying to understand all of this better myself, so while I
am no expert, here is my take:
Lucene is really a combined Vector Space / Boolean Model search engine.
At its core, Lucene is essentially a Vector Space Model search engine:
scoring is done by comparing a query term vector to each of the document
term vectors. However, on top of this, Lucene allows a Boolean Model by
constraining results using a BooleanQuery.
So when Lucene finds the score for "mark OR mandy", the idea is the same
as for "mark AND mandy". The difference is that the BooleanQuery will
treat the Must and Should clause differently: if a term is labeled Must
but is not in the document, the document won't match. If a Should term
is not in the document, the BooleanQuery excludes no extra documents on
that account, but the term may contribute 0 towards the similarity
score. The BooleanQuery kind of clamps down on top of the Vector Space
TermVector similarity scoring, allowing for a hybrid system.
The coord factor essentially juices the term vector similarity score
based on how many query terms are in the document. Term overlap is
already taken into account during the term vector similarity part, but
apparently users don't like how that ranks eg users intuitively think
that sharing more terms between document and query is more important
than sharing fewer very highly weighted terms. So basically, coord is
just trying to reorder things a bit based on reported user expectations.
- Mark
Ghinwa Choueiter wrote:
but shouldn't the coord factor kick in with AND instead of OR? I
understand why you would want to use coord in the case of AND, where
you reward more the documents that contain most of the terms in the
query. However in the case of OR, it should not matter if all the OR
operands are in the document?
-Ghinwa
----- Original Message ----- From: "Erik Hatcher"
<[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Sunday, March 09, 2008 1:22 PM
Subject: Re: Scoring a query with OR's
On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote:
but what exactly happens when there are OR's, for eg. (life OR
place OR time)
The scoring equation can get a score for life, place, time
separately, but what does it do with them then? Does it also add them.
The coord factor kicks in then:
<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/
apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)>
the formula listed here should help too:
<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/
apache/lucene/search/Similarity.html>
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]