[ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574214#action_12574214 ]
Mathieu Lecarme commented on LUCENE-1190: ----------------------------------------- With a FuzzyQuery, for example, you iterate over Term in index, and looking for the nearest one. PrefixQuery or regular expression work in a similar way. If you say, fuzzy querying will never gives a word with different size of 1 (size+1 or size -1), you can restrict the list of candidates, and ngram index can help you more. Some token filter destroy the word. Stemmer for example. If you wont to search wide, stemmer can help you, but can't use PrefixQuery with stemmed word. So, you can stemme word in a lexicon and use it as a synonym. You index "dog" and look for "doggy", "dogs" and "dog". Lexicon can use static list of word, from hunspell index or wikipedia parsing, or words extracted from your index. for the word "Lucene" : word:lucene pop:42 anagram.anagram:celnu aphone.start:LS aphone.gram:LS aphone.gram:SN aphone.end:SN aphone.size:3 aphone.phonem:LSN ngram.start:lu ngram.gram:lu ngram.gram:uc ngram.gram:ce ngram.gram:en ngram.gram:ne ngram.end:ne ngram.size:6 stemmer.stem:lucen Yes. M. > a lexicon object for merging spellchecker and synonyms from stemming > -------------------------------------------------------------------- > > Key: LUCENE-1190 > URL: https://issues.apache.org/jira/browse/LUCENE-1190 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/*, Search > Affects Versions: 2.3 > Reporter: Mathieu Lecarme > Attachments: aphone+lexicon.patch, aphone+lexicon.patch > > > Some Lucene features need a list of referring word. Spellchecking is the > basic example, but synonyms is an other use. Other tools can be used > smoothlier with a list of words, without disturbing the main index : stemming > and other simplification of word (anagram, phonetic ...). > For that, I suggest a Lexicon object, wich contains words (Term + frequency), > wich can be built from Lucene Directory, or plain text files. > Classical TokenFilter can be used with Lexicon (LowerCaseFilter and > ISOLatin1AccentFilter should be the most useful). > Lexicon uses a Lucene Directory, each Word is a Document, each meta is a > Field (word, ngram, phonetic, fields, anagram, size ...). > Above a minimum size, number of differents words used in an index can be > considered as stable. So, a standard Lexicon (built from wikipedia by > example) can be used. > A similarTokenFilter is provided. > A spellchecker will come soon. > A fuzzySearch implementation, a neutral synonym TokenFilter can be done. > Unused words can be remove on demand (lazy delete?) > Any criticism or suggestions? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]