Shalin Shekhar Mangar wrote:
If onlyMorePopular=true, then the algorithm finds tokens which have greater
frequency than the searched term. Among these terms, the one which is
closest (by edit distance) is returned.

Okay, this is a bit weird, but I think I got it now. Let me try to explain it using my example. When I search for "gran" (frequency 10) I get the suggestion "grand" (frequency 17) when using onlyMorePopular=true. When I use onlyMorePopular=false there are no suggestions at all. This is because there are some (rare) terms which are closer to "gran" than "grand", but all of them are not considered, because there frequency is below 10. Is that correct? But then, why isn't "grand" promoted to first place and returned as a valid suggestion?


I think I now understand the source of the confusion. onlyMorePopular=true
is a special behavior which uses *only* those tokens which have higher
frequency than the searched term. onlyMorePopular=false just switches off
this special behavior. It does *not* limit suggestions to tokens which have
lesser frequency than the searched term. In fact, onlyMorePopular=false does
not use frequency of tokens at all. We should document this clearly to avoid
such confusions in the future.

I'm still missing the two parameters accuracy and spellcheck.count. Let me try to explain how I (now) think the algorithm works:

1) Take all terms from the index as a basic set.
2) If onlyMorePopular=true remove all terms from the basic set which have a frequency below the frequency of the search term. 3) Sort the basic set in respect of distance to the search term and keep the <spellcheck.count> terms whith the smallest distance and which are "within accuracy". 4) Remove of terms which have a lower frequency than the search term in the case onlyMorePopular=false.
5) Return the remaining terms as suggestions.

Point 3 would explain why I do not get any suggestions for "gran" having
onlyMorePopular=false. Nevertheless I think this is a bug since point 3 should take into account the frequency as well and promote suggestions with high enough frequency if suggestion with low frequency are deleted.

But this is just my assumption on how the algorithm works which explains why there are no suggestions using onlyMorePopular=false. Maybe I am wrong, but somewhere in the process "grand" is deleted from the result set.


2) It would be nice if one could get suggestion with lower frequency than
the checked word (which is, to me, what onlyMorePopular=false implies).

We could enhance spell checker to do that. But can you please explain your
use-case for limiting suggestions to tokens which have lesser frequency? The
goal of spell checker is to give suggestions of wrongly spelled words. It
was neither designed nor intended to give any other sort of query
suggestions.

An example would be the mentioned "grand turismo" (regard that in the example above I was searching for "gran" whereas now I am searching for "grand"). "gran" would not be returned as a suggestion because "grand" is more frequent in the index. And yes, I know, returning a suggestion in this case will be only useful if there is more than one word in the search term. You proposed to use KeywordTokenizer for this case but a) I (again) was not able to find any documentation for this and b) we are working on a different solution for this case using stored search queries. If you are interested, it works like this: For every word in the query get some spell checking suggestions. Combine these and find out if any of these combinations has been search for (successfully) before. Propose the one with the highest (search) frequency. Looks promising so far, but the "gran turismo" example won't work, since there are too many "grand"s in the index.

Thanks,
Marcus

Reply via email to