David, Perhaps I misunderstand somehting so please correct me if I do. I used http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high frequency in your index and should have a pretty low levenshtein distance. Any idea what causes this behavior?
cheers, Aad -----Original Message----- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Tuesday, 14 September, 2004 21:23 To: Lucene Users List Subject: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Andrzej Bialecki wrote: > David Spencer wrote: > >> >> I can/should send the code out. The logic is that for any terms in a >> query that have zero matches, go thru all the terms(!) and calculate >> the Levenshtein string distance, and return the best matches. A more >> intelligent way of doing this is to instead look for terms that also >> match on the 1st "n" (prob 3) chars. > > > ...or prepare in advance a fast lookup index - split all existing > terms > to bi- or trigrams, create a separate lookup index, and then simply for > each term ask a phrase query (phrase = all n-grams from an input term), > with a slop > 0, to get similar existing terms. This should be fast, and > you could provide a "did you mean" function too... > Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp [3] Here's the javadoc: http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGra mSpeller.html [4] Here's source in HTML: http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/s pell/NGramSpeller.html#line.152 [5] A few more details: Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher [6] If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback. thx, Dave --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]