David Spencer writes:
> >
> > could you put the current version of your code on that website as a java
>
> Weblog entry updated:
>
> http://searchmorph.com/weblog/index.php?id=23
>
thanks
>
> Great suggestion and thanks for that idiom - I should know such things
> by now. To clarify the "issu
Morus Walter wrote:
Hi David,
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
phases. First you build a "fast lookup index" as mentioned above. Then
to correct a word you do a query in this index based on the ngrams in
the misspelled word.
Let's see.
[1] Source is attached
Hi David,
>
> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
> phases. First you build a "fast lookup index" as mentioned above. Then
> to correct a word you do a query in this index based on the ngrams in
> the misspelled word.
>
> Let's see.
>
> [1] Source is attached
Also,
You can also use an alternative spellchecker for the 'checking part' and
use the Ngram algorithm for the 'suggestion' part. Only if the spell
'check' declares a word illegal the 'suggestion' part would perform its
magic.
cheers,
Aad
Doug Cutting wrote:
> David Spencer wrote:
>
>> [1] Th
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a
term in the index, but the next 2 are. So it ignores the last 2 terms
("recursive" and "descent") and suggests alternatives t
Andrzej Bialecki wrote:
David Spencer wrote:
To restate the question for a second.
The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as
it's just a transposition away, thus the string distance is low.
But - I guess the problem w/ the algorithm is
om/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller cont
David Spencer wrote:
To restate the question for a second.
The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as
it's just a transposition away, thus the string distance is low.
But - I guess the problem w/ the algorithm is that for short words lik
Andrzej Bialecki wrote:
Aad Nales wrote:
David,
Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequenc
itions to the code and will
report back if anything of interest changes here.
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker w
y expectations (most likely ;-)
2. something in the code..
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene
Aad N
Aad Nales wrote:
David,
Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequency in your index and
???
uld have a pretty low levenshtein distance. Any idea what causes this
behavior?
cheers,
Aad
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, 14 September, 2004 21:23
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellch
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a
term in the index, but the next 2 are. So it ignores the last 2 terms
("recursive" and "descent") and suggests alternatives t
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term
in the index, but the next 2 are. So it ignores the last 2 terms
("recursive" and "descent") and suggests alternatives to
"recursize"...thu
Andrzej Bialecki wrote:
I was wondering about the way you build the n-gram queries. You
basically don't care about their position in the input term. Originally
I thought about using PhraseQuery with a slop - however, after checking
the source of PhraseQuery I realized that this probably wouldn't
Andrzej Bialecki wrote:
David Spencer wrote:
...or prepare in advance a fast lookup index - split all existing
terms to bi- or trigrams, create a separate lookup index, and then
simply for each term ask a phrase query (phrase = all n-grams from
an input term), with a slop > 0, to get similar existi
David Spencer wrote:
...or prepare in advance a fast lookup index - split all existing
terms to bi- or trigrams, create a separate lookup index, and then
simply for each term ask a phrase query (phrase = all n-grams from
an input term), with a slop > 0, to get similar existing terms.
This should be
List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a
query that have zero matches, go thru all the terms(!) and calculate
the Levenshtein string
: combining open office
spellchecker with Lucene
Andrzej Bialecki wrote:
> David Spencer wrote:
>
>>
>> I can/should send the code out. The logic is that for any terms in a
>> query that have zero matches, go thru all the terms(!) and calculate
>> the Levenshtein s
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a
query that have zero matches, go thru all the terms(!) and calculate
the Levenshtein string distance, and return the best matches. A more
intelligent way of doing this is to instead
Doug Cutting wrote:
> David Spencer wrote:
>
>> Doug Cutting wrote:
>>
>>> And one should not try correction at all for terms which occur in a
>>> large proportion of the collection.
>>
>>
>>
>> I keep thinking over this one and I don't understand it. If a user
>> misspells a word and the "did yo
Doug Cutting wrote:
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a
large proportion of the collection.
I keep thinking over this one and I don't understand it. If a user
misspells a word and the "did you mean" spelling correction algo
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a
large proportion of the collection.
I keep thinking over this one and I don't understand it. If a user
misspells a word and the "did you mean" spelling correction algorithm
determines th
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to b
eks dev wrote:
Hi Doug,
Perhaps. Are folks really better at spelling the
beginning of words?
Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars mat
Hi Doug,
> Perhaps. Are folks really better at spelling the
> beginning of words?
Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars match).
Jaro-W
David Spencer wrote:
Good heuristics but are there any more precise, standard guidelines as
to how to balance or combine what I think are the following possible
criteria in suggesting a better choice:
Not that I know of.
- ignore(penalize?) terms that are rare
I think this one is easy to threshol
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to b
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analy
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a
query that have zero matches, go thru all the terms(!) and calculate
the Levenshtein string distance, and return the best matches. A more
intelligent way of doing this is to instead
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a
query that have zero matches, go thru all the terms(!) and calculate the
Levenshtein string distance, and return the best matches. A more
intelligent way of doing this is to instead look for terms that also
Aad Nales wrote:
Hi All,
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create
Hi All,
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
base
34 matches
Mail list logo