Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

David Spencer Thu, 16 Sep 2004 11:58:41 -0700

Morus Walter wrote:

Hi David,
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word.
Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful.
great :-)
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
could you put the current version of your code on that website as a java


Weblog entry updated:

http://searchmorph.com/weblog/index.php?id=23

To link to source code:

http://www.searchmorph.com/pub/ngramspeller/NGramSpeller.java

source also? At least until it's in the lucene sandbox.
I created an ngram index on one of my indexes and think I found an issue
in the indexing code:
There is an option -f to specify the field on which the ngram index will be created. However there is no code to restrict the term enumeration on this field.

So instead of final TermEnum te = r.terms(); i'd suggest final TermEnum te = r.terms(new Term(field, "")); and a check within the loop over the terms if the enumerated term still has fieldname field, e.g. Term t = te.term(); if ( !t.field().equals(field) ) { break; }

otherwise you loop over all terms in all fields.

Great suggestion and thanks for that idiom - I should know such things by now. To clarify the "issue", it's just a performance one, not other functionality...anyway I put in the code - and to be scientific I benchmarked it two times before the change and two times after - and the results were suprising the same both times (1:45 to 1:50 with an index that takes up > 200MB). Probably there are cases where this will run faster, and the code seems more "correct" now so it's in.

An interesting application of this might be an ngram-Index enhanced version
of the FuzzyQuery. While this introduces more complexity on the indexing
side, it might be a large speedup for fuzzy searches.

I also thinking of reviewing the list to see if anyone had done a "Jaro Winkler" fuzzy query yet and doing that....

Thanks,
 Dave

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Reply via email to