Re: spellcheck: issues

Jason Rennie Tue, 07 Oct 2008 11:19:17 -0700

On Tue, Oct 7, 2008 at 11:56 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

> Is there anyway you can write up a small test case?  This definitely sounds
> like a bug.

I tried adding single word documents according to the top ten suggestions
and frequencies for "chanl".  I.e. I created a fresh index, then added 834
"chanel" docs; 10 "chant" docs; 8 "chang" docs; 4 "chani" docs; 1 doc each
of "chand", "chana", "charl" and "chane"; 106 docs of "chan"; and 1950 docs
of "chair".  The fact that "chan" would come after the single-freq terms
seems wrong to me.

I'm guessing the "FuzzyQuery score" (
http://wiki.apache.org/jakarta-lucene/SpellChecker) may be the reason for
some of the weird results I'm seeing.  Based on what I've seen and also
according to the SpellChecker wiki, it sounds like ordering is done first by
this FuzzyQuery score ((edit distance)/(length of word)), then by
popularity.  This seems to explain "chan" coming after "chand" (above),
"candyâ" coming before "candy" and "yell" coming before "yello".

On Tue, Oct 7, 2008 at 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

> Again, probably b/c of the distance.  What distance measure are you using?

I'm not specifying a distance measure.

> No, it should run in both cases.  Can you reproduce in a small test case?

In this test case I created, I searched for "chane" (with spellcheck=true)
and got one result.  When I searched for "chanel", it returned
numFound="834".  I have "accuracy" set to 0.5.  Should the spellchecker not
suggest "chanel" for the "chane" query?

Jason

Re: spellcheck: issues

Reply via email to