I'll slice my contrib in small parts

1) Synonym (Token + a weight)
2) Synonym provider from OO.o thesaurus
3) SynonymTokenFilter
4) Query expander wich apply a filter (and a boost) on each of its TermQuery
5) a Synonym filter for the query expander
6) to be efficient, Synonym can be exclude if doesn't exist in the Index.
7) Stemming can be used as a dynamic Synonym

Spell checking or the "do you mean?" pattern
1) The main concept is in the SpellCheck contrib, but in a not expandable implementation 2) In some language, like French, homophony is very important in mispelling, "there is more than one way to write it" 3) Homophony rules is provided by Aspell in a neutral language (just like SnowBall for stemming), I implemented a translator to build Java class from aspell file (it's the same format in aspell evolution : myspell and hunspell, wich are used in OO.o and firefox family)

Storing information about word found in an index
1) It's the Dictionary used in SpellCheck contrib, in a more open way : a lexicon. It's a plain old lucene index, word become a Document, and Field store computed informations like size, Ngram token and homophony. All use filter took from TokenFilter, code duplication is avoided. 2) this information can be not synchronized with the index, in order to not slow indexation process, so some informations need to be lately check (is this synonym already exist in the index?), and lexicon correction can be done on the fly (if the synonym doesn't exist, write it in the lexicon for the next time). There is some work here to find the best and fastest way to keep information synchronized between index and lexicon (hard link, log for nightly replay, complete iteration over the index to find deleted and new stuff ...)
3) Similar (more than only Synonym) and Near (mispelled) words use Lexicon.

Extending it
1) Lexicon can be used to store Noun, ie words that better work together, like "New York", "Apple II" or "Alexander the great". Extracting nouns from a thesaurus is very hard, but Wikipedia peoples done a part of the work, article titles can be a good start to build a noun list. And it works in many languages. Noun can be used as an intuitive PhraseQuery, or as a suggestion for refining a results.

Implementig it well in Lucene
SpellCheck and WordNet contrib do a part of it, but in a specific and not extensible way, I think it's better when fundation is checked by Lucene maintener, and after, contrib is built on top of this fundation.


Otis Gospodnetic a écrit :
Grant, I think Mathieu is hinting at his JIRA contribution (I looked at it 
briefly the other day, but haven't had the chance to really understand it).

Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Mathieu Lecarme <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Wednesday, March 12, 2008 5:47:40 AM
Subject: an API for synonym in Lucene-core

Why Lucen doesn't have a clean synonym API?
WordNet contrib is not an answer, it provides an Interface for its own needs, and most of the world don't speak english. Compass provides a tool, just like Solr. Lucene is the framework for applications like Solr, Nutch or Compass, why don't backport low level features of this project? A synonym API should provide a TokenFilter, an abstract storage should map token -> similar tokens with weight, and a tools for expanding query. Openoffice dictionnary project can provides data in differents languages, with compatible licences, I presume.


To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to