Can you share your spellchecker setup and the code for the test case? I would like to reproduce it and see what's going on.


On Oct 7, 2008, at 2:18 PM, Jason Rennie wrote:

On Tue, Oct 7, 2008 at 11:56 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

Is there anyway you can write up a small test case? This definitely sounds
like a bug.


I tried adding single word documents according to the top ten suggestions and frequencies for "chanl". I.e. I created a fresh index, then added 834 "chanel" docs; 10 "chant" docs; 8 "chang" docs; 4 "chani" docs; 1 doc each of "chand", "chana", "charl" and "chane"; 106 docs of "chan"; and 1950 docs of "chair". The fact that "chan" would come after the single-freq terms
seems wrong to me.

I'm guessing the "FuzzyQuery score" (
http://wiki.apache.org/jakarta-lucene/SpellChecker) may be the reason for some of the weird results I'm seeing. Based on what I've seen and also according to the SpellChecker wiki, it sounds like ordering is done first by
this FuzzyQuery score ((edit distance)/(length of word)), then by
popularity. This seems to explain "chan" coming after "chand" (above),
"candyĆ¢" coming before "candy" and "yell" coming before "yello".

On Tue, Oct 7, 2008 at 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

Again, probably b/c of the distance. What distance measure are you using?


I'm not specifying a distance measure.


No, it should run in both cases. Can you reproduce in a small test case?


In this test case I created, I searched for "chane" (with spellcheck=true)
and got one result.  When I searched for "chanel", it returned
numFound="834". I have "accuracy" set to 0.5. Should the spellchecker not
suggest "chanel" for the "chane" query?

Jason

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








Reply via email to