Otis, 
I think it was proposed to have spell checker that works on multiple tokens / 
Document:

where field to be searched with SpellChecker" looks like  "lucene search 
library" does not get tokenized and then fed to the SpellChecker, rather having 
this as a  "single token" that gets chopped into n-Grams.

"lucene search library"->"lu 
uc 
ce 
en 
ne se ea ar".... 

so far so good, that would work, but re-scoring part of the SpellChecker would 
be far for optimal for these cases, plain edit distance does not support gaps 
(some sort of Needleman Wunsch would work nice). 

Think of this as having N-Gram index of Documents, rather than tokens. This 
approach absolutely makes sense for short "documents" like Address.

And to come back to the Question, yes, you can make separate Field from your 
address and index it as a keyword, than you can just feed this field to the 
SpellChecker. It will not be perfect.... but will do solid job. Your case will 
be covered ok.

Commonwealth:
Communwealth
Comonwealth
Common wealth 


----- Original Message ----
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 31 January, 2008 6:02:28 AM
Subject: Re: Spell checking street names

Hmmm, 
"untokenized 
n-gram 
spell 
checker"... 
does 
that 
really 
make 
sense?
lucene 
as 
2-gram: 
lu 
uc 
ce 
en 
ne..... 
but 
all 
as 
a 
single 
token?  
No, 
I 
don't 
think 
that 
will 
work 
with 
the 
Lucene 
spellchecker.

As 
for 
non-tokenizing 
Analyzer 
- 
KeywordAnalyzer.

Otis
--
Sematext 
-- 
http://sematext.com/ 
-- 
Lucene 
- 
Solr 
- 
Nutch

----- 
Original 
Message 
----
From: 
Max 
Metral 
<[EMAIL PROTECTED]>
To: 
java-user@lucene.apache.org
Sent: 
Wednesday, 
January 
30, 
2008 
11:34:20 
AM
Subject: 
Spell 
checking 
street 
names

I'm 
using 
Lucene 
to 
spell 
check 
street 
names.  
Right 
now, 
I'm 
using
Double 
Metaphone 
on 
the 
street 
name 
(we 
have 
a 
sophisticated 
regex 
to
parse 
out 
the 
NAME 
as 
opposed 
to 
the 
unit, 
number, 
street 
type, 
or
suffix).  
I 
think 
that 
Double 
Metaphone 
is 
probably 
overkill/wrong, 
and
a 
spell 
checking 
approach 
(n-gram 
based) 
would 
be 
better.  
Part 
of 
the
reason 
is 
if 
we 
look 
at 
some 
common 
mistakes:

 

For 
Commonwealth:

Communwealth

Comonwealth

Common 
wealth

 

Double 
metaphone 
will 
get 
the 
first 
two, 
but 
not 
the 
last.  
Spell 
check
(I 
think) 
would 
get 
all 
3.  
The 
last 
is 
much 
more 
common 
than 
in 
typical
generic 
text 
search 
(Fairoaks 
vs. 
Fair 
Oaks, 
New 
Market 
vs. 
Newmarket,
etc).  
However, 
spell 
check 
will 
only 
get 
the 
third 
if 
the 
n-gram 
input
is 
untokenized 
(right?).

 

 
Conceptually, 
I 
feel 
like 
people 
will 
most 
often 
misspell 
or 
mistype
rather 
than 
completely 
omitting 
words 
from 
the 
street 
name.  
So 
running
the 
n-gram 
on 
the 
untokenized 
street 
name 
seems 
like 
a 
good 
thing.
Problem 
is 
I 
can't 
see 
how 
I 
do 
this, 
SpellChecker 
seems 
to 
always 
want
to 
tokenize 
things, 
and 
I'm 
a 
bit 
confused 
on 
how 
to 
give 
it 
an 
analyzer
that 
doesn't 
tokenize.

 

I 
feel 
like 
this 
might 
be 
a 
newbie 
question, 
so 
apologies 
if 
so.  
But,
1) 
does 
an 
untokenized 
n-gram 
spell 
checker 
seem 
like 
a 
good 
thing 
for
this 
app? 
2) 
Which 
analyzer 
can 
I 
use 
for 
no 
tokenization 
at 
all?

 

--Max





---------------------------------------------------------------------
To 
unsubscribe, 
e-mail: 
[EMAIL PROTECTED]
For 
additional 
commands, 
e-mail: 
[EMAIL PROTECTED]






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to