On 1/25/07, Walter Lewis <[EMAIL PROTECTED]> wrote:
I ran the debug against the two following queries:

   q=(James Sutherland) returns 13
   q=(James~0.75 Sutherland~0.75) returns 1

OK, I have an idea of what's going on... here are your two parsed
queries side by side:

+(+text:jame +text:sutherland) +searchSet:testSet
+(+text:james~0.75 +text:sutherland~0.75) +searchSet:testSet

I can tell from the first that this is a stemmed field... "james" is
transformed to "jame"
The Lucene query parser doesn't do stemming and other analysis for
things like prefix or fuzzy queries (that would have it's own big set
of problems), but instead just lowercases.

So your second fuzzy query of "james~0.75" doesn't exactly match
exactly what is indexed.

Lucene expands something like james~0.75, to the closest terms by edit-distance.
But, the number of terms is limited to BooleanQuery.maxClauseCount
(1024 by default).  So my guess is that there are more than 1024 other
terms closer to "james" than "jame" is, so "jame" (the actual indexed
form of james when it is stemmed), isn't included.

I'm not an expert at edit distance, but the implementing classes are here:
http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/FuzzyTermEnum.java?view=markup
http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/FuzzyQuery.java?view=markup

So, you could
- increase the size of maxClauseCount (it will slow down fuzzy and
wildcard type queries though)
- index the field twice using copyField, and then do fuzzy queries on
the non-stemmed version.
- ask the lucene list for other ideas

-Yonik

Reply via email to