Hi,
I emailed a question earlier about the difference between OR and AND in a
Boolean query. So in what I am trying to do, I need AND to behave like an
OR ( or what I like to call "soft AND"), and I need OR to behave like a
logic OR, meaning that I don't want to reward documents that have more of
the OR operands. It is easy for me to fix the AND, but is there a
straightforward way of fixing the OR?
Many thanks!
-Ghinwa
On Sun, 9 Mar 2008, Mark Miller wrote:
I have been trying to understand all of this better myself, so while I am no
expert, here is my take:
Lucene is really a combined Vector Space / Boolean Model search engine.
At its core, Lucene is essentially a Vector Space Model search engine:
scoring is done by comparing a query term vector to each of the document term
vectors. However, on top of this, Lucene allows a Boolean Model by
constraining results using a BooleanQuery.
So when Lucene finds the score for "mark OR mandy", the idea is the same as
for "mark AND mandy". The difference is that the BooleanQuery will treat the
Must and Should clause differently: if a term is labeled Must but is not in
the document, the document won't match. If a Should term is not in the
document, the BooleanQuery excludes no extra documents on that account, but
the term may contribute 0 towards the similarity score. The BooleanQuery kind
of clamps down on top of the Vector Space TermVector similarity scoring,
allowing for a hybrid system.
The coord factor essentially juices the term vector similarity score based on
how many query terms are in the document. Term overlap is already taken into
account during the term vector similarity part, but apparently users don't
like how that ranks eg users intuitively think that sharing more terms
between document and query is more important than sharing fewer very highly
weighted terms. So basically, coord is just trying to reorder things a bit
based on reported user expectations.
- Mark
Ghinwa Choueiter wrote:
but shouldn't the coord factor kick in with AND instead of OR? I understand
why you would want to use coord in the case of AND, where you reward more
the documents that contain most of the terms in the query. However in the
case of OR, it should not matter if all the OR operands are in the
document?
-Ghinwa
----- Original Message ----- From: "Erik Hatcher"
<[EMAIL PROTECTED]>
To: <java-user@lucene.apache.org>
Sent: Sunday, March 09, 2008 1:22 PM
Subject: Re: Scoring a query with OR's
On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote:
but what exactly happens when there are OR's, for eg. (life OR place OR
time)
The scoring equation can get a score for life, place, time separately,
but what does it do with them then? Does it also add them.
The coord factor kicks in then:
<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/
apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)>
the formula listed here should help too:
<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/
apache/lucene/search/Similarity.html>
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]