Re: an API for synonym in Lucene-core

Mathieu Lecarme Thu, 13 Mar 2008 02:53:14 -0700

I'll slice my contrib in small parts

Synonyms
1) Synonym (Token + a weight)
2) Synonym provider from OO.o thesaurus
3) SynonymTokenFilter
4) Query expander wich apply a filter (and a boost) on each of its TermQuery
5) a Synonym filter for the query expander
6) to be efficient, Synonym can be exclude if doesn't exist in the Index.
7) Stemming can be used as a dynamic Synonym


Spell checking or the "do you mean?" pattern

1) The main concept is in the SpellCheck contrib, but in a notexpandable implementation2) In some language, like French, homophony is very important inmispelling, "there is more than one way to write it"3) Homophony rules is provided by Aspell in a neutral language (justlike SnowBall for stemming), I implemented a translator to build Javaclass from aspell file (it's the same format in aspell evolution :myspell and hunspell, wich are used in OO.o and firefox family)

https://issues.apache.org/jira/browse/LUCENE-956

Storing information about word found in an index

1) It's the Dictionary used in SpellCheck contrib, in a more open way :a lexicon. It's a plain old lucene index, word become a Document, andField store computed informations like size, Ngram token and homophony.All use filter took from TokenFilter, code duplication is avoided.2) this information can be not synchronized with the index, in order tonot slow indexation process, so some informations need to be latelycheck (is this synonym already exist in the index?), and lexiconcorrection can be done on the fly (if the synonym doesn't exist, writeit in the lexicon for the next time). There is some work here to findthe best and fastest way to keep information synchronized between indexand lexicon (hard link, log for nightly replay, complete iteration overthe index to find deleted and new stuff ...)

3) Similar (more than only Synonym) and Near (mispelled) words use Lexicon.
https://issues.apache.org/jira/browse/LUCENE-1190

Extending it

1) Lexicon can be used to store Noun, ie words that better worktogether, like "New York", "Apple II" or "Alexander the great".Extracting nouns from a thesaurus is very hard, but Wikipedia peoplesdone a part of the work, article titles can be a good start to build anoun list. And it works in many languages.Noun can be used as an intuitive PhraseQuery, or as a suggestion forrefining a results.


Implementig it well in Lucene

SpellCheck and WordNet contrib do a part of it, but in a specific andnot extensible way, I think it's better when fundation is checked byLucene maintener, and after, contrib is built on top of this fundation.


M.


Otis Gospodnetic a écrit :

Grant, I think Mathieu is hinting at his JIRA contribution (I looked at it 
briefly the other day, but haven't had the chance to really understand it).

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Mathieu Lecarme <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Wednesday, March 12, 2008 5:47:40 AM
Subject: an API for synonym in Lucene-core

Why Lucen doesn't have a clean synonym API?
WordNet contrib is not an answer, it provides an Interface for its ownneeds, and most of the world don't speak english.Compass provides a tool, just like Solr. Lucene is the framework forapplications like Solr, Nutch or Compass, why don't backport low levelfeatures of this project?A synonym API should provide a TokenFilter, an abstract storage shouldmap token -> similar tokens with weight, and a tools for expanding query.Openoffice dictionnary project can provides data in differentslanguages, with compatible licences, I presume.
M.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: an API for synonym in Lucene-core

Reply via email to