Re: [jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Mathieu Lecarme Sun, 02 Mar 2008 05:16:59 -0800

hum, quote and question disappear.

Le 2 mars 08 à 13:32, Mathieu Lecarme (JIRA) a écrit :

[ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574214#action_12574214 ]
Mathieu Lecarme commented on LUCENE-1190:
-----------------------------------------

>> For example, I don't know what you mean by "Some Lucene featuresneed a list of referring word". Do you mean "a list of associatedwords"?

With a FuzzyQuery, for example, you iterate over Term in index, and
looking for the nearest one. PrefixQuery or regular expression work in
a similar way.
If you say, fuzzy querying will never gives a word with different size
of 1 (size+1 or size -1), you can restrict the list of candidates, and
ngram index can help you more.

Some token filter destroy the word. Stemmer for example. If you wont
to search wide, stemmer can help you, but can't use PrefixQuery with
stemmed word. So, you can stemme word in a lexicon and use it as a
synonym. You index "dog" and look for "doggy",  "dogs" and "dog".
Lexicon can use static list of word, from hunspell index or wikipedia
parsing, or words extracted from your index.

>> Each meta is a Field.... what do you mean by that? Could youplease give an example?

for the word "Lucene" :

word:lucene
pop:42
anagram.anagram:celnu
aphone.start:LS
aphone.gram:LS
aphone.gram:SN
aphone.end:SN
aphone.size:3
aphone.phonem:LSN
ngram.start:lu
ngram.gram:lu
ngram.gram:uc
ngram.gram:ce
ngram.gram:en
ngram.gram:ne
ngram.end:ne
ngram.size:6
stemmer.stem:lucen

>> Hm, not sure I know what you mean. Are you saying that once youcreate a sufficiently large lexicon/dictionary/index, the number ofnew terms starts decreasing? (Heap's Law? http://en.wikipedia.org/wiki/Heaps'_law)

Yes.
a lexicon object for merging spellchecker and synonyms from stemming
--------------------------------------------------------------------

               Key: LUCENE-1190
               URL: https://issues.apache.org/jira/browse/LUCENE-1190
           Project: Lucene - Java
        Issue Type: New Feature
        Components: contrib/*, Search
  Affects Versions: 2.3
          Reporter: Mathieu Lecarme
       Attachments: aphone+lexicon.patch, aphone+lexicon.patch
Some Lucene features need a list of referring word. Spellcheckingis the basic example, but synonyms is an other use. Other tools canbe used smoothlier with a list of words, without disturbing themain index : stemming and other simplification of word (anagram,phonetic ...).For that, I suggest a Lexicon object, wich contains words (Term +frequency), wich can be built from Lucene Directory, or plain textfiles.Classical TokenFilter can be used with Lexicon (LowerCaseFilter andISOLatin1AccentFilter should be the most useful).Lexicon uses a Lucene Directory, each Word is a Document, each metais a Field (word, ngram, phonetic, fields, anagram, size ...).Above a minimum size, number of differents words used in an indexcan be considered as stable. So, a standard Lexicon (built fromwikipedia by example) can be used.
A similarTokenFilter is provided.
A spellchecker will come soon.
A fuzzySearch implementation, a neutral synonym TokenFilter can bedone.
Unused words can be remove on demand (lazy delete?)
Any criticism or suggestions?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Reply via email to