Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
David Spencer writes: > > > > could you put the current version of your code on that website as a java > > Weblog entry updated: > > http://searchmorph.com/weblog/index.php?id=23 > thanks > > Great suggestion and thanks for that idiom - I should know such things > by now. To clarify the "issue", it's just a performance one, not other > functionality...anyway I put in the code - and to be scientific I > benchmarked it two times before the change and two times after - and the > results were suprising the same both times (1:45 to 1:50 with an index > that takes up > 200MB). Probably there are cases where this will run > faster, and the code seems more "correct" now so it's in. > Ahh, I see, you check the field later. The logging made me think, you index all fields you loop over, in which case one might get unwanted words into the ngram index. > > > > > > > > An interesting application of this might be an ngram-Index enhanced version > > of the FuzzyQuery. While this introduces more complexity on the indexing > > side, it might be a large speedup for fuzzy searches. > > I also thinking of reviewing the list to see if anyone had done a "Jaro > Winkler" fuzzy query yet and doing that > I went into another direction, and changed the ngram index and search to use a simliarity that computes m * m / ( n1 * n2) where m is the number of matches and n1 is the number of ngrams in the query and n2 is the number of ngrams in the word. (At least if I got that right; I'm not sure if I understand all parts of the similarity class correctly) After removing the document boost in the ngram index based on the word frequency in the original index I find the results pretty good. My data is a number of encyclopedias and dictionaries and I only use the headwords for the ngram index. Term frequency doesn't seem relevent in this case. I still use the levenshtein distance to modify the score and sort according to score / distance but in most cases this does not make a difference. So I'll probably drop the distance calculation completely. I also see few difference between using 2- and 3-grams on the one hand and only using 2-grams on the other. So I'll presumably drop the 3-grams. I'm not sure, if the similarity I use, is useful in general, but I attached it to this message in case someone is interested. Note that you need to set the similarity for the index writer and searcher and thus have to reindex in case you want to give it a try. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Morus Walter wrote: Hi David, Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. great :-) [4] Here's source in HTML: http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152 could you put the current version of your code on that website as a java Weblog entry updated: http://searchmorph.com/weblog/index.php?id=23 To link to source code: http://www.searchmorph.com/pub/ngramspeller/NGramSpeller.java source also? At least until it's in the lucene sandbox. I created an ngram index on one of my indexes and think I found an issue in the indexing code: There is an option -f to specify the field on which the ngram index will be created. However there is no code to restrict the term enumeration on this field. So instead of final TermEnum te = r.terms(); i'd suggest final TermEnum te = r.terms(new Term(field, "")); and a check within the loop over the terms if the enumerated term still has fieldname field, e.g. Term t = te.term(); if ( !t.field().equals(field) ) { break; } otherwise you loop over all terms in all fields. Great suggestion and thanks for that idiom - I should know such things by now. To clarify the "issue", it's just a performance one, not other functionality...anyway I put in the code - and to be scientific I benchmarked it two times before the change and two times after - and the results were suprising the same both times (1:45 to 1:50 with an index that takes up > 200MB). Probably there are cases where this will run faster, and the code seems more "correct" now so it's in. An interesting application of this might be an ngram-Index enhanced version of the FuzzyQuery. While this introduces more complexity on the indexing side, it might be a large speedup for fuzzy searches. I also thinking of reviewing the list to see if anyone had done a "Jaro Winkler" fuzzy query yet and doing that Thanks, Dave Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Hi David, > > Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 > phases. First you build a "fast lookup index" as mentioned above. Then > to correct a word you do a query in this index based on the ngrams in > the misspelled word. > > Let's see. > > [1] Source is attached and I'd like to contribute it to the sandbox, esp > if someone can validate that what it's doing is reasonable and useful. > great :-) > > [4] Here's source in HTML: > > http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152 > could you put the current version of your code on that website as a java source also? At least until it's in the lucene sandbox. I created an ngram index on one of my indexes and think I found an issue in the indexing code: There is an option -f to specify the field on which the ngram index will be created. However there is no code to restrict the term enumeration on this field. So instead of final TermEnum te = r.terms(); i'd suggest final TermEnum te = r.terms(new Term(field, "")); and a check within the loop over the terms if the enumerated term still has fieldname field, e.g. Term t = te.term(); if ( !t.field().equals(field) ) { break; } otherwise you loop over all terms in all fields. An interesting application of this might be an ngram-Index enhanced version of the FuzzyQuery. While this introduces more complexity on the indexing side, it might be a large speedup for fuzzy searches. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: frequent terms - Re: combining open office spellchecker with Lucene
Also, You can also use an alternative spellchecker for the 'checking part' and use the Ngram algorithm for the 'suggestion' part. Only if the spell 'check' declares a word illegal the 'suggestion' part would perform its magic. cheers, Aad Doug Cutting wrote: > David Spencer wrote: > >> [1] The user enters a query like: >> recursize descent parser >> >> [2] The search code parses this and sees that the 1st word is not a >> term in the index, but the next 2 are. So it ignores the last 2 terms >> ("recursive" and "descent") and suggests alternatives to >> "recursize"...thus if any term is in the index, regardless of >> frequency, it is left as-is. >> >> I guess you're saying that, if the user enters a term that appears in >> the index and thus is sort of spelled correctly ( as it exists in some >> doc), then we use the heuristic that any sufficiently large doc >> collection will have tons of misspellings, so we assume that rare >> terms in the query might be misspelled (i.e. not what the user >> intended) and we suggest alternativies to these words too (in addition >> to the words in the query that are not in the index at all). > > > Almost. > > If the user enters "a recursize purser", then: "a", which is in, say, > >50% of the documents, is probably spelled correctly and "recursize", > which is in zero documents, is probably mispelled. But what about > "purser"? If we run the spell check algorithm on "purser" and generate > "parser", should we show it to the user? If "purser" occurs in 1% of > documents and "parser" occurs in 5%, then we probably should, since > "parser" is a more common word than "purser". But if "parser" only > occurs in 1% of the documents and purser occurs in 5%, then we probably > shouldn't bother suggesting "parser". > > If you wanted to get really fancy then you could check how frequently > combinations of query terms occur, i.e., does "purser" or "parser" occur > more frequently near "descent". But that gets expensive. I updated the code to have an optional popularity filter - if true then it only returns matches more popular (frequent) than the word that is passed in for spelling correction. If true (default) then for common words like "remove", no results are returned now, as expected: http://www.searchmorph.com/kat/spell.jsp?s=remove But if you set it to false (bottom slot in the form at the bottom of the page) then the algorithm happily looks for alternatives: http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&max r=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0 TBD I need to update the javadoc & repost the code I guess. Also as per earlier post I also store simple transpositions for words in the ngram-index. -- Dave > > Doug > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters "a recursize purser", then: "a", which is in, say, >50% of the documents, is probably spelled correctly and "recursize", which is in zero documents, is probably mispelled. But what about "purser"? If we run the spell check algorithm on "purser" and generate "parser", should we show it to the user? If "purser" occurs in 1% of documents and "parser" occurs in 5%, then we probably should, since "parser" is a more common word than "purser". But if "parser" only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting "parser". If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does "purser" or "parser" occur more frequently near "descent". But that gets expensive. I updated the code to have an optional popularity filter - if true then it only returns matches more popular (frequent) than the word that is passed in for spelling correction. If true (default) then for common words like "remove", no results are returned now, as expected: http://www.searchmorph.com/kat/spell.jsp?s=remove But if you set it to false (bottom slot in the form at the bottom of the page) then the algorithm happily looks for alternatives: http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0 TBD I need to update the javadoc & repost the code I guess. Also as per earlier post I also store simple transpositions for words in the ngram-index. -- Dave Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: David Spencer wrote: To restate the question for a second. The misspelled word is: "conts". The sugggestion expected is "const", which seems reasonable enough as it's just a transposition away, thus the string distance is low. But - I guess the problem w/ the algorithm is that for short words like this, with transpositions, the two words won't share many ngrams. Just looking at 3grams... conts -> con ont nts const -> con ons nst Thus they just share 1 3gram, thus this is why it scores so low. This is an interesting issue, how to tune the algorithm so that it might return words this close higher. If you added 2-grams, then it would look like this (constructing also special start/end grams): Oh cute trick to indicate prefixes and suffixes. Anyway, as per prev post I reformed index w/ ngrams from length 2 to 5, and also store transpositions, and w/ appropriate boosts :) then "const" is returned 2nd. http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1 conts -> _c co on nt ts s_ const -> _c co on ns st t_ which gives 50% of overlap. In another system that I designed we were using a combination of 2-4 grams, albeit for a slightly different purpose, so in this case it would be: conts: _c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_ const: _c co on ns st t_, _co con ons nst st_, _con cons onst nst_ and the overlap is 40%. I guess one way is to add all simple transpositions to the lookup table (the "ngram index") so that these could easily be found, with the heuristic that "a frequent way of misspelling words is to transpose two adjacent letters". Yes, sounds like a good idea. Even though it increases the size of the lookup index, it still beats using the linear search... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Aad Nales wrote: By trying: if you type const you will find that it returns 216 hits. The third sports 'const' as a term (space seperated and all). I would expect 'conts' to return with const as well. But again I might be mistaken. I am now trying to figure what the problem might be: 1. my expectations (most likely ;-) 2. something in the code.. I enhanced the code to store simple transpositions also and I regenerated my site w/ ngrams from 2 to 5 chars. If you set the transposition boost up to 10 then "const" is returned 2nd... http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1 -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, 15 September, 2004 12:23 To: Lucene Users List Subject: Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Aad Nales wrote: David, Perhaps I misunderstand somehting so please correct me if I do. I used http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high frequency in your index and ??? how do you know that? Remember, this is an index of _Java_docs, and "const" is not a Java keyword. should have a pretty low levenshtein distance. Any idea what causes this behavior? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
David Spencer wrote: To restate the question for a second. The misspelled word is: "conts". The sugggestion expected is "const", which seems reasonable enough as it's just a transposition away, thus the string distance is low. But - I guess the problem w/ the algorithm is that for short words like this, with transpositions, the two words won't share many ngrams. Just looking at 3grams... conts -> con ont nts const -> con ons nst Thus they just share 1 3gram, thus this is why it scores so low. This is an interesting issue, how to tune the algorithm so that it might return words this close higher. If you added 2-grams, then it would look like this (constructing also special start/end grams): conts -> _c co on nt ts s_ const -> _c co on ns st t_ which gives 50% of overlap. In another system that I designed we were using a combination of 2-4 grams, albeit for a slightly different purpose, so in this case it would be: conts: _c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_ const: _c co on ns st t_, _co con ons nst st_, _con cons onst nst_ and the overlap is 40%. I guess one way is to add all simple transpositions to the lookup table (the "ngram index") so that these could easily be found, with the heuristic that "a frequent way of misspelling words is to transpose two adjacent letters". Yes, sounds like a good idea. Even though it increases the size of the lookup index, it still beats using the linear search... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: Aad Nales wrote: David, Perhaps I misunderstand somehting so please correct me if I do. I used http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high frequency in your index and ??? how do you know that? Remember, this is an index of _Java_docs, and "const" is not a Java keyword. I added a line of output to the right column under the 'details' box. "const" appears 216 times in the index (out of 96k docs), thus it is indeed kinda rare. http://www.searchmorph.com/kat/spell.jsp?s=const should have a pretty low levenshtein distance. Any idea what causes this behavior? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Aad Nales wrote: By trying: if you type const you will find that it returns 216 hits. The third sports 'const' as a term (space seperated and all). I would expect 'conts' to return with const as well. But again I might be mistaken. I am now trying to figure what the problem might be: 1. my expectations (most likely ;-) 2. something in the code.. Good question. If I use the form at the bottom of the page and ask for more results, the suggestion of "const" does eventually show up - 99th however(!). http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=2.0&bend=1.0 Even boosting the prefix match from 2.0 to 10.0 only changes the ranking a few slots. http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=10.0&bend=1.0 To restate the question for a second. The misspelled word is: "conts". The sugggestion expected is "const", which seems reasonable enough as it's just a transposition away, thus the string distance is low. But - I guess the problem w/ the algorithm is that for short words like this, with transpositions, the two words won't share many ngrams. Just looking at 3grams... conts -> con ont nts const -> con ons nst Thus they just share 1 3gram, thus this is why it scores so low. This is an interesting issue, how to tune the algorithm so that it might return words this close higher. I guess one way is to add all simple transpositions to the lookup table (the "ngram index") so that these could easily be found, with the heuristic that "a frequent way of misspelling words is to transpose two adjacent letters". Based on other mails I'll make some additions to the code and will report back if anything of interest changes here. -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, 15 September, 2004 12:23 To: Lucene Users List Subject: Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Aad Nales wrote: David, Perhaps I misunderstand somehting so please correct me if I do. I used http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high frequency in your index and ??? how do you know that? Remember, this is an index of _Java_docs, and "const" is not a Java keyword. should have a pretty low levenshtein distance. Any idea what causes this behavior? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
By trying: if you type const you will find that it returns 216 hits. The third sports 'const' as a term (space seperated and all). I would expect 'conts' to return with const as well. But again I might be mistaken. I am now trying to figure what the problem might be: 1. my expectations (most likely ;-) 2. something in the code.. -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, 15 September, 2004 12:23 To: Lucene Users List Subject: Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Aad Nales wrote: > David, > > Perhaps I misunderstand somehting so please correct me if I do. I used > http://www.searchmorph.com/kat/spell.jsp to look for conts without > changing any of the default values. What I got as results did not > include 'const' which has quite a high frequency in your index and ??? how do you know that? Remember, this is an index of _Java_docs, and "const" is not a Java keyword. > should have a pretty low levenshtein distance. Any idea what causes > this behavior? -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Aad Nales wrote: David, Perhaps I misunderstand somehting so please correct me if I do. I used http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high frequency in your index and ??? how do you know that? Remember, this is an index of _Java_docs, and "const" is not a Java keyword. should have a pretty low levenshtein distance. Any idea what causes this behavior? -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
David, Perhaps I misunderstand somehting so please correct me if I do. I used http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high frequency in your index and should have a pretty low levenshtein distance. Any idea what causes this behavior? cheers, Aad -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Tuesday, 14 September, 2004 21:23 To: Lucene Users List Subject: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Andrzej Bialecki wrote: > David Spencer wrote: > >> >> I can/should send the code out. The logic is that for any terms in a >> query that have zero matches, go thru all the terms(!) and calculate >> the Levenshtein string distance, and return the best matches. A more >> intelligent way of doing this is to instead look for terms that also >> match on the 1st "n" (prob 3) chars. > > > ...or prepare in advance a fast lookup index - split all existing > terms > to bi- or trigrams, create a separate lookup index, and then simply for > each term ask a phrase query (phrase = all n-grams from an input term), > with a slop > 0, to get similar existing terms. This should be fast, and > you could provide a "did you mean" function too... > Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp [3] Here's the javadoc: http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGra mSpeller.html [4] Here's source in HTML: http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/s pell/NGramSpeller.html#line.152 [5] A few more details: Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher [6] If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback. thx, Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters "a recursize purser", then: "a", which is in, say, >50% of the documents, is probably spelled correctly and "recursize", which is in zero documents, is probably mispelled. But what about "purser"? If we run the spell check algorithm on "purser" and generate "parser", should we show it to the user? If "purser" occurs in 1% of documents and "parser" occurs in 5%, then we probably should, since "parser" is a more common word than "purser". But if "parser" only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting "parser". OK, sure, got it. I'll give it a think and try to add this option to my just submitted spelling code. If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does "purser" or "parser" occur more frequently near "descent". But that gets expensive. Yeah, expensive for a large scale search engine, but probably appropriate for a desktop engine. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters "a recursize purser", then: "a", which is in, say, >50% of the documents, is probably spelled correctly and "recursize", which is in zero documents, is probably mispelled. But what about "purser"? If we run the spell check algorithm on "purser" and generate "parser", should we show it to the user? If "purser" occurs in 1% of documents and "parser" occurs in 5%, then we probably should, since "parser" is a more common word than "purser". But if "parser" only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting "parser". If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does "purser" or "parser" occur more frequently near "descent". But that gets expensive. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: I was wondering about the way you build the n-gram queries. You basically don't care about their position in the input term. Originally I thought about using PhraseQuery with a slop - however, after checking the source of PhraseQuery I realized that this probably wouldn't be that fast... You use BooleanQuery and start/end boosts instead, which may give similar results in the end but much cheaper. Sloppy PhraseQuery's are slower than BooleanQueries, but not horribly slower. The problem is that they don't handle the case where phrase elements are missing altogether, while a BooleanQuery does. So what you really need is maybe a variation of a sloppy PhraseQuery that scores matches that do not contain all of the terms... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: David Spencer wrote: ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. The background for this suggestion was that I was playing some time ago with a Luke plugin that builds various sorts of ancillary indexes, but then I never finished it... Kudos for actually making it work ;-) Sure, it was a fun little edge project. For the most part the code was done last week right after this thread appeared, but it always takes a while to get it from 95 to 100%. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. There have been many requests for this or similar functionality in the past, I believe it should go into sandbox. I was wondering about the way you build the n-gram queries. You basically don't care about their position in the input term. Originally I thought about using PhraseQuery with a slop - however, after checking the source of PhraseQuery I realized that this probably wouldn't be that fast... You use BooleanQuery and start/end boosts instead, which may give similar results in the end but much cheaper. I also wonder how this algorithm would behave for smaller values of Sure, I'll try to rebuild the demo w/ lengths 2-5 (and then the query page can test any conitguous combo). start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram length, the more "fuzziness" you introduce, which may or may not be desirable (increased recall at the cost of precision - for small indexes this may be useful from the user's perspective because you will always get a plausible hit, for huge indexes it's a loss). [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp Very nice demo! Thanks, kinda designed for ngram-nerds if you know what I mean :) I bet it's running way faster than the linear search Indeed, this is almost zero time, whereas the simple and dumb linear search was taking me 10sec. I will have to redo the sites main search page so it uses this new code, TBD, prob tomorrow. over terms :-), even though you have to build the index in advance. But if you work with static or mostly static indexes this doesn't matter. Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. You may want to experiment with 2 <= n <= 5. Some n-gram based Yep, will do prob tomorrow. techniques use all lengths together, some others use just single length, results also vary depending on the language... I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher I think this is a reasonable heuristics. Reading the code I would present it this way: ok, thx, will update - words that share more ngrams with the orig word score higher, and words that share rare ngrams with the orig word score higher (as a natural consequence of using BooleanQuery), - and, frequently occuring words score higher (as a consequence of using per-Document boosts), - from reading the source code I see that you use Levenshtein distance to prune the resultset of too long/too short results, I think also that because you don't use the positional information about the input n-grams you may be getting some really weird hits. Good point, though I haven't seen this yet. Might be due to the prefix boost and maybe some Markov chain magic tending to only show reasonable words. You could prune them by simply checking if you find a (threshold) of input ngrams in the right sequence in the found terms. This shouldn't be too costly Good point, I'll try to add that in as an optional parameter. because you operate on a smal
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
David Spencer wrote: ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. The background for this suggestion was that I was playing some time ago with a Luke plugin that builds various sorts of ancillary indexes, but then I never finished it... Kudos for actually making it work ;-) [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. There have been many requests for this or similar functionality in the past, I believe it should go into sandbox. I was wondering about the way you build the n-gram queries. You basically don't care about their position in the input term. Originally I thought about using PhraseQuery with a slop - however, after checking the source of PhraseQuery I realized that this probably wouldn't be that fast... You use BooleanQuery and start/end boosts instead, which may give similar results in the end but much cheaper. I also wonder how this algorithm would behave for smaller values of start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram length, the more "fuzziness" you introduce, which may or may not be desirable (increased recall at the cost of precision - for small indexes this may be useful from the user's perspective because you will always get a plausible hit, for huge indexes it's a loss). [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp Very nice demo! I bet it's running way faster than the linear search over terms :-), even though you have to build the index in advance. But if you work with static or mostly static indexes this doesn't matter. Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. You may want to experiment with 2 <= n <= 5. Some n-gram based techniques use all lengths together, some others use just single length, results also vary depending on the language... I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher I think this is a reasonable heuristics. Reading the code I would present it this way: - words that share more ngrams with the orig word score higher, and words that share rare ngrams with the orig word score higher (as a natural consequence of using BooleanQuery), - and, frequently occuring words score higher (as a consequence of using per-Document boosts), - from reading the source code I see that you use Levenshtein distance to prune the resultset of too long/too short results, I think also that because you don't use the positional information about the input n-grams you may be getting some really weird hits. You could prune them by simply checking if you find a (threshold) of input ngrams in the right sequence in the found terms. This shouldn't be too costly because you operate on a small result set. [6] If people want to vote me in as a committer to the sandbox then I can Well, someone needs to maintain the code after all... ;-) -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Tate Avery wrote: I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp How embarassing! Sorry! Fixed! T -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 3:23 PM To: Lucene Users List Subject: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars. ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp [3] Here's the javadoc: http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html [4] Here's source in HTML: http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152 [5] A few more details: Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher [6] If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback. thx, Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp T -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 3:23 PM To: Lucene Users List Subject: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Andrzej Bialecki wrote: > David Spencer wrote: > >> >> I can/should send the code out. The logic is that for any terms in a >> query that have zero matches, go thru all the terms(!) and calculate >> the Levenshtein string distance, and return the best matches. A more >> intelligent way of doing this is to instead look for terms that also >> match on the 1st "n" (prob 3) chars. > > > ...or prepare in advance a fast lookup index - split all existing terms > to bi- or trigrams, create a separate lookup index, and then simply for > each term ask a phrase query (phrase = all n-grams from an input term), > with a slop > 0, to get similar existing terms. This should be fast, and > you could provide a "did you mean" function too... > Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp [3] Here's the javadoc: http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html [4] Here's source in HTML: http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152 [5] A few more details: Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher [6] If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback. thx, Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars. ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp [3] Here's the javadoc: http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html [4] Here's source in HTML: http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152 [5] A few more details: Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher [6] If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback. thx, Dave package org.apache.lucene.spell; /* * The Apache Software License, Version 1.1 * * Copyright (c) 2001-2003 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in *the documentation and/or other materials provided with the *distribution. * * 3. The end-user documentation included with the redistribution, *if any, must include the following acknowledgment: * "This product includes software developed by the *Apache Software Foundation (http://www.apache.org/)." *Alternately, this acknowledgment may appear in the software itself, *if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation" and *"Apache Lucene" must not be used to endorse or promote products *derived from this software without prior written permission. For *written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called "Apache", *"Apache Lucene", nor may "Apache" appear in their name, without *prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ===
RE: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: > David Spencer wrote: > >> Doug Cutting wrote: >> >>> And one should not try correction at all for terms which occur in a >>> large proportion of the collection. >> >> >> >> I keep thinking over this one and I don't understand it. If a user >> misspells a word and the "did you mean" spelling correction algorithm >> determines that a frequent term is a good suggestion, why not suggest >> it? The very fact that it's common could mean that it's more likely >> that the user wanted this word (well, the heuristic here is that users >> frequently search for frequent terms, which is probabably wrong, but >> anyway..). > > > I think you misunderstood me. What I meant to say was that if the > term > the user enters is very common then spell correction may be skipped. > Very common words which are similar to the term the user entered should > of course be shown. But if the user's term is very common one need not > even attempt to find similarly-spelled words. Is that any better? Yes, sure, thx, I understand now - but maybe not - the context I was something like this: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. My idea is to first execute the query and only execute the 'spell check' if the number of results is lower than a certain treshhold. Secondly, I would like to use the 'stemming' functionality that MySpell offers to be used for all stuff that is written to the index together with the POS appearance. Thirdly I want to regularly scan the index for often used words to be added to the list of 'approved' terms. This would serve another purpose of the customer, which is building an synonym index for Dutch words used in an eductional context. But having read all the input I think that using the index itself for a first spellcheck is probably not a bad start. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). > > Doug > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I think you misunderstood me. What I meant to say was that if the term the user enters is very common then spell correction may be skipped. Very common words which are similar to the term the user entered should of course be shown. But if the user's term is very common one need not even attempt to find similarly-spelled words. Is that any better? Yes, sure, thx, I understand now - but maybe not - the context I was something like this: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I think you misunderstood me. What I meant to say was that if the term the user enters is very common then spell correction may be skipped. Very common words which are similar to the term the user entered should of course be shown. But if the user's term is very common one need not even attempt to find similarly-spelled words. Is that any better? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" Note that a spell checker used with a search engine should use collection frequency information. That's to say, only "corrections" which are more frequent in the collection than what the user entered should be displayed. Frequency information can also be used when constructing the checker. For example, one need never consider proposing terms that occur in very few documents. And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I know in other contexts of IR frequent terms are penalized but in this context it seems that frequent terms should be fine... -- Dave Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
eks dev wrote: Hi Doug, Perhaps. Are folks really better at spelling the beginning of words? Yes they are. There were some comprehensive empirical studies on this topic. Winkler modification on Jaro string distance is based on this assumption (boosting similarity if first n, I think 4, chars match). Jaro-Winkler is well documented and some folks thinks that it is much more efficient and precise than plain Edit distance (of course for normal language, not numbers or so). I will try to dig-out some references from my disk on Good ole Citeseer finds 2 docs that seem relevant: http://citeseer.ist.psu.edu/cs?cs=1&q=Winkler+Jaro&submit=Documents&co=Citations&cm=50&cf=Any&ao=Citations&am=20&af=Any I have some of the ngram spelling suggestion stuff, based on earlier msgs in this thread, working in my dev tree. I'll try to get a test site up later today for people to fool around with. this topic, if you are interested. On another note, I would even suggest using Jaro-Winkler distance as default for fuzzy query. (one could configure max prefix required => prefix query to reduce number of distance calculations). This could speed-up fuzzy search dramatically. Hope this was helpful, Eks ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Hi Doug, > Perhaps. Are folks really better at spelling the > beginning of words? Yes they are. There were some comprehensive empirical studies on this topic. Winkler modification on Jaro string distance is based on this assumption (boosting similarity if first n, I think 4, chars match). Jaro-Winkler is well documented and some folks thinks that it is much more efficient and precise than plain Edit distance (of course for normal language, not numbers or so). I will try to dig-out some references from my disk on this topic, if you are interested. On another note, I would even suggest using Jaro-Winkler distance as default for fuzzy query. (one could configure max prefix required => prefix query to reduce number of distance calculations). This could speed-up fuzzy search dramatically. Hope this was helpful, Eks ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
David Spencer wrote: Good heuristics but are there any more precise, standard guidelines as to how to balance or combine what I think are the following possible criteria in suggesting a better choice: Not that I know of. - ignore(penalize?) terms that are rare I think this one is easy to threshold: ignore matching terms that are rarer than the term entered. - ignore(penalize?) terms that are common This, in effect, falls out of the previous criterion. A term that is very common will not have any matching terms that are more common. As an optimization, you could avoid even looking for matching terms when a term is very common. - terms that are closer (string distance) to the term entered are better This is the meaty one. - terms that start w/ the same 'n' chars as the users term are better Perhaps. Are folks really better at spelling the beginning of words? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" Note that a spell checker used with a search engine should use collection frequency information. That's to say, only "corrections" which are more frequent in the collection than what the user entered should be displayed. Frequency information can also be used when constructing the checker. For example, one need never consider proposing terms that occur in very few documents. And one should not try correction at all for terms which occur in a large proportion of the collection. Good heuristics but are there any more precise, standard guidelines as to how to balance or combine what I think are the following possible criteria in suggesting a better choice: - ignore(penalize?) terms that are rare - ignore(penalize?) terms that are common - terms that are closer (string distance) to the term entered are better - terms that start w/ the same 'n' chars as the users term are better Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" Note that a spell checker used with a search engine should use collection frequency information. That's to say, only "corrections" which are more frequent in the collection than what the user entered should be displayed. Frequency information can also be used when constructing the checker. For example, one need never consider proposing terms that occur in very few documents. And one should not try correction at all for terms which occur in a large proportion of the collection. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars. ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... Sounds interesting/fun but I'm not sure if I'm following exactly. Let's talk thru the trigram index case. Are you saying that for every trigram in every word there will be a mapping of trigram -> term? Thus if "recursive" is in the (orig) index then we'd create entries like: rec -> recursive ecu -> ... cur -> ... urs -> ... rsi -> ... siv -> ... ive -> ... And so on for all terms in the orig index. OK fine. But now the user types in a query like "recursivz". What's the algorithm - obviously I guess take all trigrams in the bad term and go thru the trigram-index, but there will be lots of suggestions. Now what - use string distance to score them? I guess that makes sense - plz confirm if I understand And so I guess the point here is we precalculate the trigram->term mappings to avoid an expensive traversal of all terms in an index, but we still use string distance as a 2nd pass (and prob should force the matches to always match on the 1st n (3) chars using the heuristic that people can usually start the spelling a word corrrectly). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars. ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Aad Nales wrote: Hi All, Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" I did a WordNet/synonym query expander. Search for "WordNet" on this page. Of interest is it stores the Wordnet info in a separate Lucene index as at its essence all an index is is a database. http://jakarta.apache.org/lucene/docs/lucene-sandbox/ Also, another variation, is to instead spell based on what terms are in the index, not what an external dictionary says. I've done this on my experimental site searchmorph.com in a dumb/inefficient way. Here's an example: http://www.searchmorph.com/kat/search.jsp?s=recursivz After you click above it takes ~10sec as it produces terms close to "recursivz". Opps - looking at the output, it looks like the same word is suggest multiple times - ouch - I must be considering all fields, not just the contents field. TBD is fixing this. (or no wonder it's so slow :)) I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars. Cheers, Aad -- Aad Nales [EMAIL PROTECTED], +31-(0)6 54 207 340 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
combining open office spellchecker with Lucene
Hi All, Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" Cheers, Aad -- Aad Nales [EMAIL PROTECTED], +31-(0)6 54 207 340 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]