a lexicon object for merging spellchecker and synonyms from stemming
--------------------------------------------------------------------

                 Key: LUCENE-1190
                 URL: https://issues.apache.org/jira/browse/LUCENE-1190
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/*, Search
    Affects Versions: 2.3
            Reporter: Mathieu Lecarme
         Attachments: aphone+lexicon.patch

Some Lucene features need a list of referring word. Spellchecking is the basic 
example, but synonyms is an other use. Other tools can be used smoothlier with 
a list of words, without disturbing the main index : stemming and other 
simplification of word (anagram, phonetic ...).
For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
wich can be built from Lucene Directory, or plain text files.
Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
ISOLatin1AccentFilter should be the most useful).
Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field 
(word, ngram, phonetic, fields, anagram, size ...).
Above a minimum size, number of differents words used in an index can be 
considered as stable. So, a standard Lexicon (built from wikipedia by example) 
can be used.
A similarTokenFilter is provided.
A spellchecker will come soon.
A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
Unused words can be remove on demand (lazy delete?)

Any criticism or suggestions?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to