Re: Extending the similarity class

Erik Hatcher Sat, 23 Jul 2005 05:22:33 -0700


On Jul 23, 2005, at 4:45 AM, Ahmed El-dawy wrote:

Only terms returned from the Analyzer are considered, so if a stop
word is removed it does not count for tf or idf.

But I need to compare according to non indexed words also. By the way,
goole does this.


Please provide an example or reference to support this claim.

Perhaps Google is doing something like what Nutch does by defaultwith a bi-gram technique of joining terms that begin with a commonterm with the successive term and overlapping it position-increment-wise. This technique allows searches to be fast when stop words needto be considered, but also optimized to avoid searching by stop wordswhen it is not a phrase query.

This will happen automatically with PhraseQuery with a slop factor.
The closer the words, the better the score.  However, with a pure
boolean query, proximity is not considered at all (nor should it
be).  You can use a large slop factor for phrases such as "quick
fox"~100 and see how the scores work then.

This means that all words must be in the result. This is not always
the case in my application. If I am searching for "quick brown fox",
"quick fox" is an acceptable result.

In the case of single term queries boolean OR'd together, Similarityscoord factor boosts results that have more clauses overlapped. Thisdoes not take proximity of the words into consideration.

I just need to know whether I need to resort the search results
according to my criteria, or there are some methods to override which
will bring results already sorted.

It seems like you're asking for a different type of Query thancurrently exists that can do a boolean OR but score based onproximity of the matching terms. Without looking it up, perhapsSpanOrQuery already does this sort of thing - though I don't think so.


    Erik



On 7/22/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:


On Jul 22, 2005, at 9:59 AM, Ahmed El-dawy wrote:

Hello,
I am using lucene to search plain text, but the order of thesearch
results is not satisfying to my needs. First, I want to know how the
similarity works. Then, I need to extend it.


Use IndexSearcher.explain() to see how each individual hit is scored
against a Query - this will be the clearest way to see why things
score the way they do.

  First, does the similarity class work on analyzed text or original
search text? To be precise, does it count the stop words as found
terms or not?


Only terms returned from the Analyzer are considered, so if a stop
word is removed it does not count for tf or idf.

Second, I want to add a factor of how relative are the terms ofthequery found in text. For example, when I search for "quick fox","fox
quick" and "quick brown fox" will be less ranked than "quick fox".


This will happen automatically with PhraseQuery with a slop factor.
The closer the words, the better the score.  However, with a pure
boolean query, proximity is not considered at all (nor should it
be).  You can use a large slop factor for phrases such as "quick
fox"~100 and see how the scores work then.

    Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Regards,
Ahmed Saad

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Extending the similarity class

Reply via email to