Re: spellcheck.onlyMorePopular

Marcus Stratmann Fri, 13 Feb 2009 07:17:14 -0800

Shalin Shekhar Mangar wrote:

If onlyMorePopular=true, then the algorithm finds tokens which have greater
frequency than the searched term. Among these terms, the one which is
closest (by edit distance) is returned.

Okay, this is a bit weird, but I think I got it now. Let me try toexplain it using my example. When I search for "gran" (frequency 10) Iget the suggestion "grand" (frequency 17) when usingonlyMorePopular=true. When I use onlyMorePopular=false there are nosuggestions at all. This is because there are some (rare) terms whichare closer to "gran" than "grand", but all of them are not considered,because there frequency is below 10. Is that correct?But then, why isn't "grand" promoted to first place and returned as avalid suggestion?

I think I now understand the source of the confusion. onlyMorePopular=true
is a special behavior which uses *only* those tokens which have higher
frequency than the searched term. onlyMorePopular=false just switches off
this special behavior. It does *not* limit suggestions to tokens which have
lesser frequency than the searched term. In fact, onlyMorePopular=false does
not use frequency of tokens at all. We should document this clearly to avoid
such confusions in the future.

I'm still missing the two parameters accuracy and spellcheck.count. Letme try to explain how I (now) think the algorithm works:


1) Take all terms from the index as a basic set.

2) If onlyMorePopular=true remove all terms from the basic set whichhave a frequency below the frequency of the search term.3) Sort the basic set in respect of distance to the search term and keepthe <spellcheck.count> terms whith the smallest distance and which are"within accuracy".4) Remove of terms which have a lower frequency than the search term inthe case onlyMorePopular=false.

5) Return the remaining terms as suggestions.

Point 3 would explain why I do not get any suggestions for "gran" having

onlyMorePopular=false. Nevertheless I think this is a bug since point 3should take into account the frequency as well and promote suggestionswith high enough frequency if suggestion with low frequency are deleted.

But this is just my assumption on how the algorithm works which explainswhy there are no suggestions using onlyMorePopular=false. Maybe I amwrong, but somewhere in the process "grand" is deleted from the result set.

2) It would be nice if one could get suggestion with lower frequency than
the checked word (which is, to me, what onlyMorePopular=false implies).


We could enhance spell checker to do that. But can you please explain your
use-case for limiting suggestions to tokens which have lesser frequency? The
goal of spell checker is to give suggestions of wrongly spelled words. It
was neither designed nor intended to give any other sort of query
suggestions.

An example would be the mentioned "grand turismo" (regard that in theexample above I was searching for "gran" whereas now I am searching for"grand"). "gran" would not be returned as a suggestion because "grand"is more frequent in the index. And yes, I know, returning a suggestionin this case will be only useful if there is more than one word in thesearch term. You proposed to use KeywordTokenizer for this case but a) I(again) was not able to find any documentation for this and b) we areworking on a different solution for this case using stored searchqueries. If you are interested, it works like this: For every word inthe query get some spell checking suggestions. Combine these and findout if any of these combinations has been search for (successfully)before. Propose the one with the highest (search) frequency. Lookspromising so far, but the "gran turismo" example won't work, since thereare too many "grand"s in the index.


Thanks,
Marcus

Re: spellcheck.onlyMorePopular

Reply via email to