On Fri, Feb 13, 2009 at 8:46 PM, Marcus Stratmann <stratm...@gmx.de> wrote:

>
> Okay, this is a bit weird, but I think I got it now. Let me try to explain
> it using my example. When I search for "gran" (frequency 10) I get the
> suggestion "grand" (frequency 17) when using onlyMorePopular=true. When I
> use onlyMorePopular=false there are no suggestions at all. This is because
> there are some (rare) terms which are  closer to "gran" than "grand", but
> all of them are not considered, because there frequency is below 10. Is that
> correct?


No. Think of onlyMorePopular as a toggle between whether to consider
frequency or not. When you say onlyMorePopular=true, higher frequency terms
are considered. When you say onlyMorePopular=false, frequency plays no role
at all and "gran" is returned because according to the spell checker, it
exists in the index and is therefore a correctly spelled term.


> I'm still missing the two parameters accuracy and spellcheck.count. Let me
> try to explain how I (now) think the algorithm works:
>
> 1) Take all terms from the index as a basic set.
> 2) If onlyMorePopular=true remove all terms from the basic set which have a
> frequency below the frequency of the search term.
> 3) Sort the basic set in respect of distance to the search term and keep
> the <spellcheck.count> terms whith the smallest distance and which are
> "within accuracy".
> 4) Remove of terms which have a lower frequency than the search term in the
> case onlyMorePopular=false.
> 5) Return the remaining terms as suggestions.
>
> Point 3 would explain why I do not get any suggestions for "gran" having
> onlyMorePopular=false. Nevertheless I think this is a bug since point 3
> should take into account the frequency as well and promote suggestions with
> high enough frequency if suggestion with low frequency are deleted.
>
> But this is just my assumption on how the algorithm works which explains
> why there are no suggestions using onlyMorePopular=false. Maybe I am wrong,
> but somewhere in the process "grand" is deleted from the result set.


Point #4 is incorrect. As I said earlier, when onlyMorePopular=false,
frequency information is not used and there is no filtering of tokens with
respect to frequency.

The implementation is a bit more complicated.

1. Read all tokens from the specified field in the solr index.
2. Create n-grams of the terms read in #1 and index them into a separate
Lucene index (spellcheck index).
3. When asked for suggestions, create n-grams of the query terms, search the
spellcheck index and collects the top (by lucene score) 10*spellcheck.count
results.
4. If onlyMorePopular=true, determine frequency of each result in the solr
index and remove terms which have lesser frequency.
5. Compute the edit distance between the result and the query token.
6. Return the top spellcheck.count results (sorted by edit distance
descending) which are greater than specified accuracy.


> An example would be the mentioned "grand turismo" (regard that in the
> example above I was searching for "gran" whereas now I am searching for
> "grand"). "gran" would not be returned as a suggestion because "grand" is
> more frequent in the index. And yes, I know, returning a suggestion in this
> case will be only useful if there is more than one word in the search term.
> You proposed to use KeywordTokenizer for this case but a) I (again) was not
> able to find any documentation for this and b) we are working on a different
> solution for this case using stored search queries. If you are interested,
> it works like this: For every word in the query get some spell checking
> suggestions. Combine these and find out if any of these combinations has
> been search for (successfully) before. Propose the one with the highest
> (search) frequency. Looks promising so far, but the "gran turismo" example
> won't work, since there are too many "grand"s in the index.


Your primary use-case is not spellcheck at all but this might work with some
hacking. Fuzzy queries may be a better solution as Walter said. Storing, all
successful search queries may be hard to scale.

-- 
Regards,
Shalin Shekhar Mangar.

Reply via email to