Re: spellcheck.onlyMorePopular
I know your issue has already been addressed but you may want to consider gran being a synonym for grand and then analyzing it as such. ~ David Smiley Marcus Stratmann wrote: Hello, I have another question concerning the spell checking mechanism. Setting onlyMorePopular=true and using the parameters spellcheck=truespellcheck.q=granq=granspellcheck.onlyMorePopular=true I get the result lst name=spellcheck lst name=suggestions lst name=gran int name=numFound1/int int name=startOffset0/int int name=endOffset4/int int name=origFreq13/int lst name=suggestion int name=frequency32/int str name=wordgrand/str /lst /lst bool name=correctlySpelledtrue/bool /lst /lst which is okay. But when I turn off onlyMorePopular spellcheck=truespellcheck.q=granq=granspellcheck.onlyMorePopular=false the output is lst name=spellcheck lst name=suggestions/ /lst I was expecting to get *more* results when I turn off onlyMorePopular and to get all of the results contained in the result without onlyMorePopular (grand) plus some more. Instead I get no spell check results at all. Why is that? Thanks, Marcus -- View this message in context: http://www.nabble.com/spellcheck.onlyMorePopular-tp21975735p22761717.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: The implementation is a bit more complicated. 1. Read all tokens from the specified field in the solr index. 2. Create n-grams of the terms read in #1 and index them into a separate Lucene index (spellcheck index). 3. When asked for suggestions, create n-grams of the query terms, search the spellcheck index and collects the top (by lucene score) 10*spellcheck.count results. 4. If onlyMorePopular=true, determine frequency of each result in the solr index and remove terms which have lesser frequency. 5. Compute the edit distance between the result and the query token. 6. Return the top spellcheck.count results (sorted by edit distance descending) which are greater than specified accuracy. Thanks, I think this makes things clear(er) now. I do agree that the documentation needs improvement on this point, as you said later in this thread. :) Your primary use-case is not spellcheck at all but this might work with some hacking. Fuzzy queries may be a better solution as Walter said. Storing, all successful search queries may be hard to scale. This is certainly true. The drawback of fuzzy searching is that you get back exact and fuzzy hits together in one result set (correct me if I'm wrong). One could filter out the exact/fuzzy hits but this would make paging impossible. The approach using KeywordTokenizer as you suggested before seems to be more promising to me. Unfortunately there seems to be no documentation for this (at least in conjunction with spell checking). If I understand this rightly, the tokenizer must be applied to the field in the search index (not the spell checking index). Is that correct? Thanks, Marcus
Re: spellcheck.onlyMorePopular
On Sun, Feb 15, 2009 at 8:56 AM, Mark Miller markrmil...@gmail.com wrote: I think thats the problem with it. People do think of it this way, and it ends up being very confusing. If you dont use onlyMorePopular, and you ask for suggestions for a word that happens to be in the index, you get the word back. So if I ask for corrections to Lucene, and its in the index, it suggests Lucene. This is nice for multi term suggestions, because for mrk lucene it might suggest mark lucene. Now say I want to toggle onlyMorePopular to add frequency into the mix - my expectation is that, perhaps now I will get the suggestion mork lucene if mork has a higher freq than mark. But I will get maybe mork luke instead, because I am guaranteed not to get Lucene as a suggestion if onlyMorePopular is on. onlyMorePopular=true considers tokens of frequency greater than equal to frequency of original token. So you may still get Lucene as a suggestion. Personally I think it all ends up being pretty counter intuitive, especially when asking for suggestions for multiple terms. You start getting suggestions for alternate spellings no matter what - Lucene could be in the index a billion times, it will still suggest something else. But with onlyMorePopular off, it will throw back Lucene. You can deal with it if you know whats up, but as we have seen from all the questions on this, its not easy to understand why things change like that. I agree that it is confusing. Do you have any suggestions on ways to fix this? More/better documentation, changes in behavior, change 'onlyMorePopular' parameter's name, etc.? -- Regards, Shalin Shekhar Mangar.
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: On Sun, Feb 15, 2009 at 8:56 AM, Mark Miller markrmil...@gmail.com wrote: I think thats the problem with it. People do think of it this way, and it ends up being very confusing. If you dont use onlyMorePopular, and you ask for suggestions for a word that happens to be in the index, you get the word back. So if I ask for corrections to Lucene, and its in the index, it suggests Lucene. This is nice for multi term suggestions, because for mrk lucene it might suggest mark lucene. Now say I want to toggle onlyMorePopular to add frequency into the mix - my expectation is that, perhaps now I will get the suggestion mork lucene if mork has a higher freq than mark. But I will get maybe mork luke instead, because I am guaranteed not to get Lucene as a suggestion if onlyMorePopular is on. onlyMorePopular=true considers tokens of frequency greater than equal to frequency of original token. So you may still get Lucene as a suggestion. Is that the only difference? When I look at the code (I'm new to this area of the code, so I certainly could be wrong, wouldnt be the first time, or less than the 100,000th probably), I see: // if the word exists in the real index and we don't care for word frequency, return the word itself if (!morePopular freq 0) { return new String[] { word }; } So if you have onlyMorePopular=false, Lucene will get Lucene if its in the index. But if we make it past that line (onlyMorePopular=true), later there is: // don't suggest a word for itself, that would be silly if (sugWord.string.equals(word)) { continue; } So you end up only getting all of the suggestions *but* Lucene, right? You had to already know the word was misspelled, and now your asking for a better one. With the onlyMorePopular=false, you only get a correction if the word is misspelled. It seems to me, if you are trying to use the suggested query thats built up, you change the behavior beyond just: onlyMorePopular=true considers tokens of frequency greater than equal to frequency of original token. - Mark
Re: spellcheck.onlyMorePopular
On Sun, Feb 15, 2009 at 10:00 PM, Mark Miller markrmil...@gmail.com wrote: But if we make it past that line (onlyMorePopular=true), later there is: // don't suggest a word for itself, that would be silly if (sugWord.string.equals(word)) { continue; } So you end up only getting all of the suggestions *but* Lucene, right? You had to already know the word was misspelled, and now your asking for a better one. With the onlyMorePopular=false, you only get a correction if the word is misspelled. Yes of course, you are right, one would never get Lucene back if onlyMorePopular=true. It seems to me, if you are trying to use the suggested query thats built up, you change the behavior beyond just: onlyMorePopular=true considers tokens of frequency greater than equal to frequency of original token. We definitely need better documentation for this option. -- Regards, Shalin Shekhar Mangar.
Re: spellcheck.onlyMorePopular
Grant Ingersoll wrote: I believe the reason is b/c when onlyMP is false, if the word itself is already in the index, it short circuits out. When onlyMP is true, it checks to see if there are more frequently occurring variations. This would mean that onlyMorePopular=false isn't useful at all. If the word is in the index it would not find less frequent words and if it is not in the index onlyMorePopular=false isn't usefull since there are no less popular words. So if you are right this is a bug, isn't it? Thanks, Marcus
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: The end goal is to give spelling suggestions. Even if it gave less frequently occurring spelling suggestions, what would you do with it? To give you an example: We have an index for computer games. One title is gran turismo. The word gran is less frequent in the index than grand. So if someone searches for grand turismo there will be no suggestion gran. And to come back to my last question: There seems to be no case in which onlyMorePopular=false makes sense (provided Grant's assumption is correct). Do you see one? Thanks, Marcus
Re: spellcheck.onlyMorePopular
On Fri, Feb 13, 2009 at 2:51 PM, Marcus Stratmann stratm...@gmx.de wrote: Shalin Shekhar Mangar wrote: The end goal is to give spelling suggestions. Even if it gave less frequently occurring spelling suggestions, what would you do with it? To give you an example: We have an index for computer games. One title is gran turismo. The word gran is less frequent in the index than grand. So if someone searches for grand turismo there will be no suggestion gran. Unless, I'm misunderstanding something, you need phrase suggestions and not individual suggestions. I mean that you need suggestions for gran turismo and not gran and turismo separately. Did you try using KeywordTokenizer for this spell check field? And to come back to my last question: There seems to be no case in which onlyMorePopular=false makes sense (provided Grant's assumption is correct). Do you see one? Here's a use-case -- you provide a mis-spelled word and you want the closest suggestion by edit distance (frequency does not matter). -- Regards, Shalin Shekhar Mangar.
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: And to come back to my last question: There seems to be no case in which onlyMorePopular=false makes sense (provided Grant's assumption is correct). Do you see one? Here's a use-case -- you provide a mis-spelled word and you want the closest suggestion by edit distance (frequency does not matter). Hm, when I try searching for grand using onlyMorePopular=false I do not get any results. Same when trying gran. It seems that there will be no results at all when using onlyMorePopular=false. Without onlyMorePopular there are suggestions for both terms, so there are suggestions close enough to the original word(s). Have you tested your example case? Anyway, if you look at it from the user's point of view: The wiki says spellcheck.onlyMorePopular -- Only return suggestions that result in more hits for the query than the existing query. This implies that if onlyMorePopular=false I will get even results with less hits. So when I'm checking grand I would expect to get the suggestion gran which is less frequent in the index. But it seems this is not the case. But even if just the documentation is wrong or unclear: 1) I could not find a case in which onlyMorePopular=false works at all. 2) It would be nice if one could get suggestion with lower frequency than the checked word (which is, to me, what onlyMorePopular=false implies). Thanks, Marcus
Re: spellcheck.onlyMorePopular
On Fri, Feb 13, 2009 at 5:05 PM, Marcus Stratmann stratm...@gmx.de wrote: Hm, when I try searching for grand using onlyMorePopular=false I do not get any results. Same when trying gran. It seems that there will be no results at all when using onlyMorePopular=false. When onlyMorePopular is false and the word you searched exists in the index, it is returned as-is. Therefore if gran and grand are both present in the index, they will be returned as is. Without onlyMorePopular there are suggestions for both terms, so there are suggestions close enough to the original word(s). Have you tested your example case? I am confused by this. Did you mean With onlyMorePopular=true there are suggestions for both terms? Anyway, if you look at it from the user's point of view: The wiki says spellcheck.onlyMorePopular -- Only return suggestions that result in more hits for the query than the existing query. This implies that if onlyMorePopular=false I will get even results with less hits. So when I'm checking grand I would expect to get the suggestion gran which is less frequent in the index. But it seems this is not the case. If onlyMorePopular=true, then the algorithm finds tokens which have greater frequency than the searched term. Among these terms, the one which is closest (by edit distance) is returned. I think I now understand the source of the confusion. onlyMorePopular=true is a special behavior which uses *only* those tokens which have higher frequency than the searched term. onlyMorePopular=false just switches off this special behavior. It does *not* limit suggestions to tokens which have lesser frequency than the searched term. In fact, onlyMorePopular=false does not use frequency of tokens at all. We should document this clearly to avoid such confusions in the future. 2) It would be nice if one could get suggestion with lower frequency than the checked word (which is, to me, what onlyMorePopular=false implies). We could enhance spell checker to do that. But can you please explain your use-case for limiting suggestions to tokens which have lesser frequency? The goal of spell checker is to give suggestions of wrongly spelled words. It was neither designed nor intended to give any other sort of query suggestions. -- Regards, Shalin Shekhar Mangar.
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: If onlyMorePopular=true, then the algorithm finds tokens which have greater frequency than the searched term. Among these terms, the one which is closest (by edit distance) is returned. Okay, this is a bit weird, but I think I got it now. Let me try to explain it using my example. When I search for gran (frequency 10) I get the suggestion grand (frequency 17) when using onlyMorePopular=true. When I use onlyMorePopular=false there are no suggestions at all. This is because there are some (rare) terms which are closer to gran than grand, but all of them are not considered, because there frequency is below 10. Is that correct? But then, why isn't grand promoted to first place and returned as a valid suggestion? I think I now understand the source of the confusion. onlyMorePopular=true is a special behavior which uses *only* those tokens which have higher frequency than the searched term. onlyMorePopular=false just switches off this special behavior. It does *not* limit suggestions to tokens which have lesser frequency than the searched term. In fact, onlyMorePopular=false does not use frequency of tokens at all. We should document this clearly to avoid such confusions in the future. I'm still missing the two parameters accuracy and spellcheck.count. Let me try to explain how I (now) think the algorithm works: 1) Take all terms from the index as a basic set. 2) If onlyMorePopular=true remove all terms from the basic set which have a frequency below the frequency of the search term. 3) Sort the basic set in respect of distance to the search term and keep the spellcheck.count terms whith the smallest distance and which are within accuracy. 4) Remove of terms which have a lower frequency than the search term in the case onlyMorePopular=false. 5) Return the remaining terms as suggestions. Point 3 would explain why I do not get any suggestions for gran having onlyMorePopular=false. Nevertheless I think this is a bug since point 3 should take into account the frequency as well and promote suggestions with high enough frequency if suggestion with low frequency are deleted. But this is just my assumption on how the algorithm works which explains why there are no suggestions using onlyMorePopular=false. Maybe I am wrong, but somewhere in the process grand is deleted from the result set. 2) It would be nice if one could get suggestion with lower frequency than the checked word (which is, to me, what onlyMorePopular=false implies). We could enhance spell checker to do that. But can you please explain your use-case for limiting suggestions to tokens which have lesser frequency? The goal of spell checker is to give suggestions of wrongly spelled words. It was neither designed nor intended to give any other sort of query suggestions. An example would be the mentioned grand turismo (regard that in the example above I was searching for gran whereas now I am searching for grand). gran would not be returned as a suggestion because grand is more frequent in the index. And yes, I know, returning a suggestion in this case will be only useful if there is more than one word in the search term. You proposed to use KeywordTokenizer for this case but a) I (again) was not able to find any documentation for this and b) we are working on a different solution for this case using stored search queries. If you are interested, it works like this: For every word in the query get some spell checking suggestions. Combine these and find out if any of these combinations has been search for (successfully) before. Propose the one with the highest (search) frequency. Looks promising so far, but the gran turismo example won't work, since there are too many grands in the index. Thanks, Marcus
Re: spellcheck.onlyMorePopular
Fuzzy search should match grand turismo to gran turismo without using spelling suggestions. At Netflix, the first hit for the query grand turismo is the movie Gran Torino and we use fuzzy with Solr. wunder On 2/13/09 3:35 AM, Marcus Stratmann stratm...@gmx.de wrote: Shalin Shekhar Mangar wrote: And to come back to my last question: There seems to be no case in which onlyMorePopular=false makes sense (provided Grant's assumption is correct). Do you see one? Here's a use-case -- you provide a mis-spelled word and you want the closest suggestion by edit distance (frequency does not matter). Hm, when I try searching for grand using onlyMorePopular=false I do not get any results. Same when trying gran. It seems that there will be no results at all when using onlyMorePopular=false. Without onlyMorePopular there are suggestions for both terms, so there are suggestions close enough to the original word(s). Have you tested your example case? Anyway, if you look at it from the user's point of view: The wiki says spellcheck.onlyMorePopular -- Only return suggestions that result in more hits for the query than the existing query. This implies that if onlyMorePopular=false I will get even results with less hits. So when I'm checking grand I would expect to get the suggestion gran which is less frequent in the index. But it seems this is not the case. But even if just the documentation is wrong or unclear: 1) I could not find a case in which onlyMorePopular=false works at all. 2) It would be nice if one could get suggestion with lower frequency than the checked word (which is, to me, what onlyMorePopular=false implies). Thanks, Marcus
Re: spellcheck.onlyMorePopular
On Fri, Feb 13, 2009 at 8:46 PM, Marcus Stratmann stratm...@gmx.de wrote: Okay, this is a bit weird, but I think I got it now. Let me try to explain it using my example. When I search for gran (frequency 10) I get the suggestion grand (frequency 17) when using onlyMorePopular=true. When I use onlyMorePopular=false there are no suggestions at all. This is because there are some (rare) terms which are closer to gran than grand, but all of them are not considered, because there frequency is below 10. Is that correct? No. Think of onlyMorePopular as a toggle between whether to consider frequency or not. When you say onlyMorePopular=true, higher frequency terms are considered. When you say onlyMorePopular=false, frequency plays no role at all and gran is returned because according to the spell checker, it exists in the index and is therefore a correctly spelled term. I'm still missing the two parameters accuracy and spellcheck.count. Let me try to explain how I (now) think the algorithm works: 1) Take all terms from the index as a basic set. 2) If onlyMorePopular=true remove all terms from the basic set which have a frequency below the frequency of the search term. 3) Sort the basic set in respect of distance to the search term and keep the spellcheck.count terms whith the smallest distance and which are within accuracy. 4) Remove of terms which have a lower frequency than the search term in the case onlyMorePopular=false. 5) Return the remaining terms as suggestions. Point 3 would explain why I do not get any suggestions for gran having onlyMorePopular=false. Nevertheless I think this is a bug since point 3 should take into account the frequency as well and promote suggestions with high enough frequency if suggestion with low frequency are deleted. But this is just my assumption on how the algorithm works which explains why there are no suggestions using onlyMorePopular=false. Maybe I am wrong, but somewhere in the process grand is deleted from the result set. Point #4 is incorrect. As I said earlier, when onlyMorePopular=false, frequency information is not used and there is no filtering of tokens with respect to frequency. The implementation is a bit more complicated. 1. Read all tokens from the specified field in the solr index. 2. Create n-grams of the terms read in #1 and index them into a separate Lucene index (spellcheck index). 3. When asked for suggestions, create n-grams of the query terms, search the spellcheck index and collects the top (by lucene score) 10*spellcheck.count results. 4. If onlyMorePopular=true, determine frequency of each result in the solr index and remove terms which have lesser frequency. 5. Compute the edit distance between the result and the query token. 6. Return the top spellcheck.count results (sorted by edit distance descending) which are greater than specified accuracy. An example would be the mentioned grand turismo (regard that in the example above I was searching for gran whereas now I am searching for grand). gran would not be returned as a suggestion because grand is more frequent in the index. And yes, I know, returning a suggestion in this case will be only useful if there is more than one word in the search term. You proposed to use KeywordTokenizer for this case but a) I (again) was not able to find any documentation for this and b) we are working on a different solution for this case using stored search queries. If you are interested, it works like this: For every word in the query get some spell checking suggestions. Combine these and find out if any of these combinations has been search for (successfully) before. Propose the one with the highest (search) frequency. Looks promising so far, but the gran turismo example won't work, since there are too many grands in the index. Your primary use-case is not spellcheck at all but this might work with some hacking. Fuzzy queries may be a better solution as Walter said. Storing, all successful search queries may be hard to scale. -- Regards, Shalin Shekhar Mangar.
Re: spellcheck.onlyMorePopular
I believe the reason is b/c when onlyMP is false, if the word itself is already in the index, it short circuits out. When onlyMP is true, it checks to see if there are more frequently occurring variations. However, I don't have the code in front of me at the moment, so I can't verify. -Grant On Feb 12, 2009, at 8:07 AM, Marcus Stratmann wrote: Hello, I have another question concerning the spell checking mechanism. Setting onlyMorePopular=true and using the parameters spellcheck =truespellcheck.q=granq=granspellcheck.onlyMorePopular=true I get the result lst name=spellcheck lst name=suggestions lst name=gran int name=numFound1/int int name=startOffset0/int int name=endOffset4/int int name=origFreq13/int lst name=suggestion int name=frequency32/int str name=wordgrand/str /lst /lst bool name=correctlySpelledtrue/bool /lst /lst which is okay. But when I turn off onlyMorePopular spellcheck =truespellcheck.q=granq=granspellcheck.onlyMorePopular=false the output is lst name=spellcheck lst name=suggestions/ /lst I was expecting to get *more* results when I turn off onlyMorePopular and to get all of the results contained in the result without onlyMorePopular (grand) plus some more. Instead I get no spell check results at all. Why is that? Thanks, Marcus