Have you looked at the explains to see what is coming out of the
FuzzyQuery? Also, are you using Hits to get that score? Scores get
normalized to 1 by that process.
-Grant
On Apr 11, 2007, at 2:06 AM, Michael Barbarelli wrote:
Hello.
I am using Lucene to submit fuzzy queries against an index. I have
noticed
that relevant matches are often retreived, but the scoring is not
at all
what I expected.
For example, if my query is "rightches~", a reference to a text
file with
the single word "righteous" is returned with a score of 100 percent.
However, I think the actual score should be somewhere in the
neighborhood of
.66, not 1. Anyone follow me? Degree of similarity is what I want
in this
case.
But Lucene score does not take into account how well a term matches a
FuzzyQuery. That just seems to be the way Lucene is built
currently. The
score is based on term frequency of the actual matching term.
FuzzyQuery
gets rewritten as a BooleanQuery with all matching terms OR'd.
Degree of similarity is what I want in this case. When
"rightches~" matches
"rightheous", I should get a similarity score of about .66.
What I want is to get at the raw difference that Lucene uses: the
Levenstein distance algorithm. I think I'll need to use the code in
FuzzyTermEnum.java (or .cs) as a starting point. I figure I can can
probably
use that code directly somehow, or at least borrow the similarity
computation.
Frankly, though, I'm not sure I'm treading down the right path on
this. Can
anyone help with specifics, past experience, or examples?
Cheers,
Mike
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]