On Jul 23, 2005, at 4:45 AM, Ahmed El-dawy wrote:
Only terms returned from the Analyzer are considered, so if a stop
word is removed it does not count for tf or idf.
But I need to compare according to non indexed words also. By the way,
goole does this.
Please provide an example or reference to support this claim.
Perhaps Google is doing something like what Nutch does by default
with a bi-gram technique of joining terms that begin with a common
term with the successive term and overlapping it position-increment-
wise. This technique allows searches to be fast when stop words need
to be considered, but also optimized to avoid searching by stop words
when it is not a phrase query.
This will happen automatically with PhraseQuery with a slop factor.
The closer the words, the better the score. However, with a pure
boolean query, proximity is not considered at all (nor should it
be). You can use a large slop factor for phrases such as "quick
fox"~100 and see how the scores work then.
This means that all words must be in the result. This is not always
the case in my application. If I am searching for "quick brown fox",
"quick fox" is an acceptable result.
In the case of single term queries boolean OR'd together, Similaritys
coord factor boosts results that have more clauses overlapped. This
does not take proximity of the words into consideration.
I just need to know whether I need to resort the search results
according to my criteria, or there are some methods to override which
will bring results already sorted.
It seems like you're asking for a different type of Query than
currently exists that can do a boolean OR but score based on
proximity of the matching terms. Without looking it up, perhaps
SpanOrQuery already does this sort of thing - though I don't think so.
Erik
On 7/22/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
On Jul 22, 2005, at 9:59 AM, Ahmed El-dawy wrote:
Hello,
I am using lucene to search plain text, but the order of the
search
results is not satisfying to my needs. First, I want to know how the
similarity works. Then, I need to extend it.
Use IndexSearcher.explain() to see how each individual hit is scored
against a Query - this will be the clearest way to see why things
score the way they do.
First, does the similarity class work on analyzed text or original
search text? To be precise, does it count the stop words as found
terms or not?
Only terms returned from the Analyzer are considered, so if a stop
word is removed it does not count for tf or idf.
Second, I want to add a factor of how relative are the terms of
the
query found in text. For example, when I search for "quick fox",
"fox
quick" and "quick brown fox" will be less ranked than "quick fox".
This will happen automatically with PhraseQuery with a slop factor.
The closer the words, the better the score. However, with a pure
boolean query, proximity is not considered at all (nor should it
be). You can use a large slop factor for phrases such as "quick
fox"~100 and see how the scores work then.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
Regards,
Ahmed Saad
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]