Hi, Yesterday, as a result of my afternoon hacking, I commited a first-pass of spelling suggestions into trunk. This is still far from finished and needs some work. I have received a lot of emails about this, so I will try to reply and sum up what I have done so far.
Spelling suggestions was an idea first worked on by Fredrik[1] in the past. His implementation had to build and use a separate index for storing all the stemmed words. This wasn't very efficient. My first implementation was very similar to his, but it used a FuzzyTermEnum to generate suggestions on the fly. FuzzyTermEnum uses Levenshtein distance to find similar words. This eliminated the need for maintaining a separate spelling index for each queryable. In this implementation suggestions were always generated for each query. Everything worked fine but I wanted to make sure there is no overhead and that queries won't take much longer than usual. Profiling showed that generating suggestions for relatively small indexes doesn't make much of a difference (2%-8% of the whole query time). However, to my dismay larger indexes proved to be a problem (45% of the whole query time on FSQ! Whoa!). That is why I chose another implementation, similar to how snippets work - spelling suggestions only on-demand. Now you can request suggestions only when you need them, by sending a SuggestionsReques which is followed by SuggestionsResponse reply. In my personal point of view this is much more efficient and doesn't affect the query responsiveness. There are still some things that need to be done, before we can all celebrate the "Did you mean to search for..." stuff. :-) * Lucene only stores stemmed forms of the words (beagle becomes beagl) We have to figure out a way to unstem the word: 1.) Hack the analyzer to get the unstemmed word 2.) Traverse through our TextCache and find a word which which contains the stem part. This is what I'll be looking into today/tomorrow. * Rank the suggestions We need to only return the highest relevant suggestions, based on: 1.) Term frequency in index 2.) Levenshtein distance score * Suggestions limiter I've currently limited the suggestions to those that have the same first character. This is based on the assumption that most likely the first letter the user types will be correct. I'm not sure if this is correct, but it does generate better results. * Multi-term queries We also need to look into handling multi-term queries, where suggestions should contain the whole phrase and not just the individual words. Sorry, for the exhausting email and lets make Beagle rock! :-) Best, Lukas _______________________________________________ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers