RE: frequent terms - Re: combining open office spellchecker with Lucene
Also, You can also use an alternative spellchecker for the 'checking part' and use the Ngram algorithm for the 'suggestion' part. Only if the spell 'check' declares a word illegal the 'suggestion' part would perform its magic. cheers, Aad Doug Cutting wrote: > David Spencer wrote: > >> [1] The user enters a query like: >> recursize descent parser >> >> [2] The search code parses this and sees that the 1st word is not a >> term in the index, but the next 2 are. So it ignores the last 2 terms >> ("recursive" and "descent") and suggests alternatives to >> "recursize"...thus if any term is in the index, regardless of >> frequency, it is left as-is. >> >> I guess you're saying that, if the user enters a term that appears in >> the index and thus is sort of spelled correctly ( as it exists in some >> doc), then we use the heuristic that any sufficiently large doc >> collection will have tons of misspellings, so we assume that rare >> terms in the query might be misspelled (i.e. not what the user >> intended) and we suggest alternativies to these words too (in addition >> to the words in the query that are not in the index at all). > > > Almost. > > If the user enters "a recursize purser", then: "a", which is in, say, > >50% of the documents, is probably spelled correctly and "recursize", > which is in zero documents, is probably mispelled. But what about > "purser"? If we run the spell check algorithm on "purser" and generate > "parser", should we show it to the user? If "purser" occurs in 1% of > documents and "parser" occurs in 5%, then we probably should, since > "parser" is a more common word than "purser". But if "parser" only > occurs in 1% of the documents and purser occurs in 5%, then we probably > shouldn't bother suggesting "parser". > > If you wanted to get really fancy then you could check how frequently > combinations of query terms occur, i.e., does "purser" or "parser" occur > more frequently near "descent". But that gets expensive. I updated the code to have an optional popularity filter - if true then it only returns matches more popular (frequent) than the word that is passed in for spelling correction. If true (default) then for common words like "remove", no results are returned now, as expected: http://www.searchmorph.com/kat/spell.jsp?s=remove But if you set it to false (bottom slot in the form at the bottom of the page) then the algorithm happily looks for alternatives: http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&max r=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0 TBD I need to update the javadoc & repost the code I guess. Also as per earlier post I also store simple transpositions for words in the ngram-index. -- Dave > > Doug > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters "a recursize purser", then: "a", which is in, say, >50% of the documents, is probably spelled correctly and "recursize", which is in zero documents, is probably mispelled. But what about "purser"? If we run the spell check algorithm on "purser" and generate "parser", should we show it to the user? If "purser" occurs in 1% of documents and "parser" occurs in 5%, then we probably should, since "parser" is a more common word than "purser". But if "parser" only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting "parser". If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does "purser" or "parser" occur more frequently near "descent". But that gets expensive. I updated the code to have an optional popularity filter - if true then it only returns matches more popular (frequent) than the word that is passed in for spelling correction. If true (default) then for common words like "remove", no results are returned now, as expected: http://www.searchmorph.com/kat/spell.jsp?s=remove But if you set it to false (bottom slot in the form at the bottom of the page) then the algorithm happily looks for alternatives: http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0 TBD I need to update the javadoc & repost the code I guess. Also as per earlier post I also store simple transpositions for words in the ngram-index. -- Dave Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters "a recursize purser", then: "a", which is in, say, >50% of the documents, is probably spelled correctly and "recursize", which is in zero documents, is probably mispelled. But what about "purser"? If we run the spell check algorithm on "purser" and generate "parser", should we show it to the user? If "purser" occurs in 1% of documents and "parser" occurs in 5%, then we probably should, since "parser" is a more common word than "purser". But if "parser" only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting "parser". OK, sure, got it. I'll give it a think and try to add this option to my just submitted spelling code. If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does "purser" or "parser" occur more frequently near "descent". But that gets expensive. Yeah, expensive for a large scale search engine, but probably appropriate for a desktop engine. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters "a recursize purser", then: "a", which is in, say, >50% of the documents, is probably spelled correctly and "recursize", which is in zero documents, is probably mispelled. But what about "purser"? If we run the spell check algorithm on "purser" and generate "parser", should we show it to the user? If "purser" occurs in 1% of documents and "parser" occurs in 5%, then we probably should, since "parser" is a more common word than "purser". But if "parser" only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting "parser". If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does "purser" or "parser" occur more frequently near "descent". But that gets expensive. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: > David Spencer wrote: > >> Doug Cutting wrote: >> >>> And one should not try correction at all for terms which occur in a >>> large proportion of the collection. >> >> >> >> I keep thinking over this one and I don't understand it. If a user >> misspells a word and the "did you mean" spelling correction algorithm >> determines that a frequent term is a good suggestion, why not suggest >> it? The very fact that it's common could mean that it's more likely >> that the user wanted this word (well, the heuristic here is that users >> frequently search for frequent terms, which is probabably wrong, but >> anyway..). > > > I think you misunderstood me. What I meant to say was that if the > term > the user enters is very common then spell correction may be skipped. > Very common words which are similar to the term the user entered should > of course be shown. But if the user's term is very common one need not > even attempt to find similarly-spelled words. Is that any better? Yes, sure, thx, I understand now - but maybe not - the context I was something like this: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. My idea is to first execute the query and only execute the 'spell check' if the number of results is lower than a certain treshhold. Secondly, I would like to use the 'stemming' functionality that MySpell offers to be used for all stuff that is written to the index together with the POS appearance. Thirdly I want to regularly scan the index for often used words to be added to the list of 'approved' terms. This would serve another purpose of the customer, which is building an synonym index for Dutch words used in an eductional context. But having read all the input I think that using the index itself for a first spellcheck is probably not a bad start. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). > > Doug > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I think you misunderstood me. What I meant to say was that if the term the user enters is very common then spell correction may be skipped. Very common words which are similar to the term the user entered should of course be shown. But if the user's term is very common one need not even attempt to find similarly-spelled words. Is that any better? Yes, sure, thx, I understand now - but maybe not - the context I was something like this: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I think you misunderstood me. What I meant to say was that if the term the user enters is very common then spell correction may be skipped. Very common words which are similar to the term the user entered should of course be shown. But if the user's term is very common one need not even attempt to find similarly-spelled words. Is that any better? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" Note that a spell checker used with a search engine should use collection frequency information. That's to say, only "corrections" which are more frequent in the collection than what the user entered should be displayed. Frequency information can also be used when constructing the checker. For example, one need never consider proposing terms that occur in very few documents. And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I know in other contexts of IR frequent terms are penalized but in this context it seems that frequent terms should be fine... -- Dave Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]