what would be the difference/benefit compared to standard lucene SpellChecker? 

If I I am not wrong:
- Lucene SpellChecker uses standard lucene index  as a storage for tokens 
instead of QDBM... meaning full inverted index with arbitrary N-grams length, 
with tf/idf/norms... not only HashMap<trigram, wordList> 

- SC uses paradigm "give me  N Best candidates (similarity)", not only "all 
above cutoff"... this Similarity depends (standard lucene Similarity) on N-Gram 
frequency, (one could even use some sexy norms to fine tune words...)...  

If I've read your proposal correctly and did not miss something important, my 
suggestion would be to have a look at lucene SC 
(http://lucene.apache.org/java/2_3_2/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.html)
 before you start 
 

have fun, 
eks



----- Original Message ----
> From: Michael McCandless <luc...@mikemccandless.com>
> To: java-dev@lucene.apache.org
> Sent: Thursday, 18 June, 2009 16:29:59
> Subject: Re: Fuzzy search change
> 
> This would make an awesome addition to Lucene!
> 
> This is similar to how Lucene's spellchecker identifies candidates, if
> I understand it right.
> 
> Would you be able to port it to java?
> 
> Mike
> 
> On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussawrote:
> > Hi,
> >
> > I wrote on this a long time ago, but haven't followed it up. I just finished
> > a C++ implementation of a spell check module in my software. I borrowed the
> > idea from Xapian. It is to use a trigram index to filter results, and then
> > use Edit Distance on the filtered set. Would such a solution be acceptable
> > to the Lucene Community? The details of my implementation are as follows:
> >
> > 1) QDBM data store hash map
> > 2) Trigram tokenizer on the input string
> > 3) Data store hash(key,value) = (trigram, keyword_id_list
> > 4) Use trigram tokenizer and match with the trigram index
> > 5) Get the IDs within the input cutoff
> > 6) Run Edit Distance on the list and return
> >
> > In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
> > runs in <0.5 sec with a keyword record count of about 1,000,000 records.
> > This is at least 3-4 times less than the current search times on Lucene.
> >
> > Since the results can be put in a thread safe hash table structure, the
> > trigram search can be distributed over a thread pool also.
> >
> > Does this seem like a workable suggestion to the community?
> >
> > Regards
> >
> > --
> > Varun Dhussa
> > Product Architect
> > CE InfoSystems (P) Ltd
> > http://www.mapmyindia.com
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to