Otis, I think it was proposed to have spell checker that works on multiple tokens / Document:
where field to be searched with SpellChecker" looks like "lucene search library" does not get tokenized and then fed to the SpellChecker, rather having this as a "single token" that gets chopped into n-Grams. "lucene search library"->"lu uc ce en ne se ea ar".... so far so good, that would work, but re-scoring part of the SpellChecker would be far for optimal for these cases, plain edit distance does not support gaps (some sort of Needleman Wunsch would work nice). Think of this as having N-Gram index of Documents, rather than tokens. This approach absolutely makes sense for short "documents" like Address. And to come back to the Question, yes, you can make separate Field from your address and index it as a keyword, than you can just feed this field to the SpellChecker. It will not be perfect.... but will do solid job. Your case will be covered ok. Commonwealth: Communwealth Comonwealth Common wealth ----- Original Message ---- From: Otis Gospodnetic <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, 31 January, 2008 6:02:28 AM Subject: Re: Spell checking street names Hmmm, "untokenized n-gram spell checker"... does that really make sense? lucene as 2-gram: lu uc ce en ne..... but all as a single token? No, I don't think that will work with the Lucene spellchecker. As for non-tokenizing Analyzer - KeywordAnalyzer. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Max Metral <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, January 30, 2008 11:34:20 AM Subject: Spell checking street names I'm using Lucene to spell check street names. Right now, I'm using Double Metaphone on the street name (we have a sophisticated regex to parse out the NAME as opposed to the unit, number, street type, or suffix). I think that Double Metaphone is probably overkill/wrong, and a spell checking approach (n-gram based) would be better. Part of the reason is if we look at some common mistakes: For Commonwealth: Communwealth Comonwealth Common wealth Double metaphone will get the first two, but not the last. Spell check (I think) would get all 3. The last is much more common than in typical generic text search (Fairoaks vs. Fair Oaks, New Market vs. Newmarket, etc). However, spell check will only get the third if the n-gram input is untokenized (right?). Conceptually, I feel like people will most often misspell or mistype rather than completely omitting words from the street name. So running the n-gram on the untokenized street name seems like a good thing. Problem is I can't see how I do this, SpellChecker seems to always want to tokenize things, and I'm a bit confused on how to give it an analyzer that doesn't tokenize. I feel like this might be a newbie question, so apologies if so. But, 1) does an untokenized n-gram spell checker seem like a good thing for this app? 2) Which analyzer can I use for no tokenization at all? --Max --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __________________________________________________________ Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]