Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-20 Thread Morus Walter
David Spencer writes: > > > > could you put the current version of your code on that website as a java > > Weblog entry updated: > > http://searchmorph.com/weblog/index.php?id=23 > thanks > > Great suggestion and thanks for that idiom - I should know such things > by now. To clarify the "issu

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-16 Thread David Spencer
Morus Walter wrote: Hi David, Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-16 Thread Morus Walter
Hi David, > > Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 > phases. First you build a "fast lookup index" as mentioned above. Then > to correct a word you do a query in this index based on the ngrams in > the misspelled word. > > Let's see. > > [1] Source is attached

RE: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-15 Thread Aad Nales
Also, You can also use an alternative spellchecker for the 'checking part' and use the Ngram algorithm for the 'suggestion' part. Only if the spell 'check' declares a word illegal the 'suggestion' part would perform its magic. cheers, Aad Doug Cutting wrote: > David Spencer wrote: > >> [1] Th

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives t

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: To restate the question for a second. The misspelled word is: "conts". The sugggestion expected is "const", which seems reasonable enough as it's just a transposition away, thus the string distance is low. But - I guess the problem w/ the algorithm is

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
om/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1 -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, 15 September, 2004 12:23 To: Lucene Users List Subject: Re: NGramSpeller cont

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread Andrzej Bialecki
David Spencer wrote: To restate the question for a second. The misspelled word is: "conts". The sugggestion expected is "const", which seems reasonable enough as it's just a transposition away, thus the string distance is low. But - I guess the problem w/ the algorithm is that for short words lik

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote: Aad Nales wrote: David, Perhaps I misunderstand somehting so please correct me if I do. I used http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high frequenc

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
itions to the code and will report back if anything of interest changes here. -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, 15 September, 2004 12:23 To: Lucene Users List Subject: Re: NGramSpeller contribution -- Re: combining open office spellchecker w

RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread Aad Nales
y expectations (most likely ;-) 2. something in the code.. -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, 15 September, 2004 12:23 To: Lucene Users List Subject: Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Aad N

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread Andrzej Bialecki
Aad Nales wrote: David, Perhaps I misunderstand somehting so please correct me if I do. I used http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high frequency in your index and ???

RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread Aad Nales
uld have a pretty low levenshtein distance. Any idea what causes this behavior? cheers, Aad -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Tuesday, 14 September, 2004 21:23 To: Lucene Users List Subject: NGramSpeller contribution -- Re: combining open office spellch

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives t

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thu

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
Andrzej Bialecki wrote: I was wondering about the way you build the n-gram queries. You basically don't care about their position in the input term. Originally I thought about using PhraseQuery with a slop - however, after checking the source of PhraseQuery I realized that this probably wouldn't

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existi

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Andrzej Bialecki
David Spencer wrote: ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
List Subject: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string

RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Tate Avery
: combining open office spellchecker with Lucene Andrzej Bialecki wrote: > David Spencer wrote: > >> >> I can/should send the code out. The logic is that for any terms in a >> query that have zero matches, go thru all the terms(!) and calculate >> the Levenshtein s

NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead

RE: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-11 Thread Aad Nales
Doug Cutting wrote: > David Spencer wrote: > >> Doug Cutting wrote: >> >>> And one should not try correction at all for terms which occur in a >>> large proportion of the collection. >> >> >> >> I keep thinking over this one and I don't understand it. If a user >> misspells a word and the "did yo

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algo

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread Doug Cutting
David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algorithm determines th

frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to b

Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
eks dev wrote: Hi Doug, Perhaps. Are folks really better at spelling the beginning of words? Yes they are. There were some comprehensive empirical studies on this topic. Winkler modification on Jaro string distance is based on this assumption (boosting similarity if first n, I think 4, chars mat

Re: combining open office spellchecker with Lucene

2004-09-10 Thread eks dev
Hi Doug, > Perhaps. Are folks really better at spelling the > beginning of words? Yes they are. There were some comprehensive empirical studies on this topic. Winkler modification on Jaro string distance is based on this assumption (boosting similarity if first n, I think 4, chars match). Jaro-W

Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
David Spencer wrote: Good heuristics but are there any more precise, standard guidelines as to how to balance or combine what I think are the following possible criteria in suggesting a better choice: Not that I know of. - ignore(penalize?) terms that are rare I think this one is easy to threshol

Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to b

Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analy

Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead

Re: combining open office spellchecker with Lucene

2004-09-09 Thread Andrzej Bialecki
David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also

Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Aad Nales wrote: Hi All, Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create

combining open office spellchecker with Lucene

2004-09-09 Thread Aad Nales
Hi All, Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer base