Re: spellcheck.onlyMorePopular

2009-03-28 Thread David Smiley @MITRE.org

I know your issue has already been addressed but you may want to consider
gran being a synonym for grand and then analyzing it as such.
~ David Smiley


Marcus Stratmann wrote:
 
 Hello,
 
 I have another question concerning the spell checking mechanism.
 Setting onlyMorePopular=true and using the parameters
 
 spellcheck=truespellcheck.q=granq=granspellcheck.onlyMorePopular=true
 
 I get the result
 
 lst name=spellcheck
   lst name=suggestions
lst name=gran
 int name=numFound1/int
 int name=startOffset0/int
 int name=endOffset4/int
 int name=origFreq13/int
 lst name=suggestion
  int name=frequency32/int
  str name=wordgrand/str
 /lst
/lst
bool name=correctlySpelledtrue/bool
   /lst
 /lst
 
 which is okay.
 But when I turn off onlyMorePopular
 
 spellcheck=truespellcheck.q=granq=granspellcheck.onlyMorePopular=false
 
 the output is
 
 lst name=spellcheck
   lst name=suggestions/
 /lst
 
 I was expecting to get *more* results when I turn off onlyMorePopular 
 and to get all of the results contained in the result without 
 onlyMorePopular (grand) plus some more. Instead I get no spell check 
 results at all. Why is that?
 
 Thanks,
 Marcus
 
 

-- 
View this message in context: 
http://www.nabble.com/spellcheck.onlyMorePopular-tp21975735p22761717.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: spellcheck.onlyMorePopular

2009-02-16 Thread Marcus Stratmann

Shalin Shekhar Mangar wrote:

The implementation is a bit more complicated.

1. Read all tokens from the specified field in the solr index.
2. Create n-grams of the terms read in #1 and index them into a separate
Lucene index (spellcheck index).
3. When asked for suggestions, create n-grams of the query terms, search the
spellcheck index and collects the top (by lucene score) 10*spellcheck.count
results.
4. If onlyMorePopular=true, determine frequency of each result in the solr
index and remove terms which have lesser frequency.
5. Compute the edit distance between the result and the query token.
6. Return the top spellcheck.count results (sorted by edit distance
descending) which are greater than specified accuracy.


Thanks, I think this makes things clear(er) now. I do agree that the 
documentation needs improvement on this point, as you said later in this 
thread. :)




Your primary use-case is not spellcheck at all but this might work with some
hacking. Fuzzy queries may be a better solution as Walter said. Storing, all
successful search queries may be hard to scale.


This is certainly true.

The drawback of fuzzy searching is that you get back exact and fuzzy 
hits together in one result set (correct me if I'm wrong). One could 
filter out the exact/fuzzy hits but this would make paging impossible.


The approach using KeywordTokenizer as you suggested before seems to be 
more promising to me. Unfortunately there seems to be no documentation 
for this (at least in conjunction with spell checking). If I understand 
this rightly, the tokenizer must be applied to the field in the search 
index (not the spell checking index). Is that correct?


Thanks,
Marcus


Re: spellcheck.onlyMorePopular

2009-02-15 Thread Shalin Shekhar Mangar
On Sun, Feb 15, 2009 at 8:56 AM, Mark Miller markrmil...@gmail.com wrote:

 I think thats the problem with it. People do think of it this way, and it
 ends up being very confusing.

 If you dont use onlyMorePopular, and you ask for suggestions for a word
 that happens to be in the index, you get the word back.

 So if I ask for corrections to Lucene, and its in the index, it suggests
 Lucene. This is nice for multi term suggestions, because for mrk lucene it
 might suggest mark lucene.

 Now say I want to toggle onlyMorePopular to add frequency into the mix - my
 expectation is that, perhaps now I will get the suggestion mork lucene if
 mork has a higher freq than mark.

 But I will get maybe mork luke instead, because I am guaranteed not to
 get Lucene as a suggestion if onlyMorePopular is on.


onlyMorePopular=true considers tokens of frequency greater than equal to
frequency of original token. So you may still get Lucene as a suggestion.


 Personally I think it all ends up being pretty counter intuitive,
 especially when asking for suggestions for multiple terms. You start getting
 suggestions for alternate spellings no matter what - Lucene could be in the
 index a billion times, it will still suggest something else. But with
 onlyMorePopular off, it will throw back Lucene. You can deal with it if you
 know whats up, but as we have seen from all the questions on this, its not
 easy to understand why things change like that.


I agree that it is confusing. Do you have any suggestions on ways to fix
this? More/better documentation, changes in behavior, change
'onlyMorePopular' parameter's name, etc.?
-- 
Regards,
Shalin Shekhar Mangar.


Re: spellcheck.onlyMorePopular

2009-02-15 Thread Mark Miller

Shalin Shekhar Mangar wrote:

On Sun, Feb 15, 2009 at 8:56 AM, Mark Miller markrmil...@gmail.com wrote:

  

I think thats the problem with it. People do think of it this way, and it
ends up being very confusing.

If you dont use onlyMorePopular, and you ask for suggestions for a word
that happens to be in the index, you get the word back.

So if I ask for corrections to Lucene, and its in the index, it suggests
Lucene. This is nice for multi term suggestions, because for mrk lucene it
might suggest mark lucene.

Now say I want to toggle onlyMorePopular to add frequency into the mix - my
expectation is that, perhaps now I will get the suggestion mork lucene if
mork has a higher freq than mark.

But I will get maybe mork luke instead, because I am guaranteed not to
get Lucene as a suggestion if onlyMorePopular is on.




onlyMorePopular=true considers tokens of frequency greater than equal to
frequency of original token. So you may still get Lucene as a suggestion.

  
Is that the only difference? When I look at the code (I'm new to this 
area of the code, so I certainly could be wrong, wouldnt be the first 
time, or less than the 100,000th probably), I see:


   // if the word exists in the real index and we don't care for word 
frequency, return the word itself

   if (!morePopular  freq  0) {
 return new String[] { word };
   }

So if you have onlyMorePopular=false, Lucene will get Lucene if its in 
the index. But if we make it past that line (onlyMorePopular=true), 
later there is:


 // don't suggest a word for itself, that would be silly
 if (sugWord.string.equals(word)) {
   continue;
 }

So you end up only getting all of the suggestions *but* Lucene, right? 
You had to already know the word was misspelled, and now your asking for 
a better one. With the onlyMorePopular=false, you only get a correction 
if the word is misspelled.


It seems to me, if you are trying to use the suggested query thats built 
up, you change the behavior beyond just:


onlyMorePopular=true considers tokens of frequency greater than equal to
frequency of original token.

- Mark





Re: spellcheck.onlyMorePopular

2009-02-15 Thread Shalin Shekhar Mangar
On Sun, Feb 15, 2009 at 10:00 PM, Mark Miller markrmil...@gmail.com wrote:

 But if we make it past that line (onlyMorePopular=true), later there is:

 // don't suggest a word for itself, that would be silly
 if (sugWord.string.equals(word)) {
   continue;
 }

 So you end up only getting all of the suggestions *but* Lucene, right? You
 had to already know the word was misspelled, and now your asking for a
 better one. With the onlyMorePopular=false, you only get a correction if the
 word is misspelled.


Yes of course, you are right, one would never get Lucene back if
onlyMorePopular=true.




 It seems to me, if you are trying to use the suggested query thats built
 up, you change the behavior beyond just:


 onlyMorePopular=true considers tokens of frequency greater than equal to
 frequency of original token.


We definitely need better documentation for this option.

-- 
Regards,
Shalin Shekhar Mangar.


Re: spellcheck.onlyMorePopular

2009-02-13 Thread Marcus Stratmann

Grant Ingersoll wrote:
I believe the reason is b/c when onlyMP is false, if the word itself is 
already in the index, it short circuits out.  When onlyMP is true, it 
checks to see if there are more frequently occurring variations.
This would mean that onlyMorePopular=false isn't useful at all. If the 
word is in the index it would not find less frequent words and if it is 
not in the index onlyMorePopular=false isn't usefull since there are no 
less popular words.

So if you are right this is a bug, isn't it?

Thanks,
Marcus


Re: spellcheck.onlyMorePopular

2009-02-13 Thread Marcus Stratmann

Shalin Shekhar Mangar wrote:

The end goal is to give spelling suggestions. Even if it gave less
frequently occurring spelling suggestions, what would you do with it?

To give you an example:
We have an index for computer games. One title is gran turismo. The 
word gran is less frequent in the index than grand. So if someone 
searches for grand turismo there will be no suggestion gran.


And to come back to my last question: There seems to be no case in which 
onlyMorePopular=false makes sense (provided Grant's assumption is 
correct). Do you see one?


Thanks,
Marcus


Re: spellcheck.onlyMorePopular

2009-02-13 Thread Shalin Shekhar Mangar
On Fri, Feb 13, 2009 at 2:51 PM, Marcus Stratmann stratm...@gmx.de wrote:

 Shalin Shekhar Mangar wrote:

 The end goal is to give spelling suggestions. Even if it gave less
 frequently occurring spelling suggestions, what would you do with it?

 To give you an example:
 We have an index for computer games. One title is gran turismo. The word
 gran is less frequent in the index than grand. So if someone searches
 for grand turismo there will be no suggestion gran.


Unless, I'm misunderstanding something, you need phrase suggestions and not
individual suggestions. I mean that you need suggestions for gran turismo
and not gran and turismo separately. Did you try using KeywordTokenizer
for this spell check field?



 And to come back to my last question: There seems to be no case in which
 onlyMorePopular=false makes sense (provided Grant's assumption is
 correct). Do you see one?


Here's a use-case -- you provide a mis-spelled word and you want the closest
suggestion by edit distance (frequency does not matter).

-- 
Regards,
Shalin Shekhar Mangar.


Re: spellcheck.onlyMorePopular

2009-02-13 Thread Marcus Stratmann

Shalin Shekhar Mangar wrote:

And to come back to my last question: There seems to be no case in which
onlyMorePopular=false makes sense (provided Grant's assumption is
correct). Do you see one?


Here's a use-case -- you provide a mis-spelled word and you want the closest
suggestion by edit distance (frequency does not matter).


Hm, when I try searching for grand using onlyMorePopular=false I do 
not get any results. Same when trying gran. It seems that there will 
be no results at all when using onlyMorePopular=false. Without 
onlyMorePopular there are suggestions for both terms, so there are 
suggestions close enough to the original word(s). Have you tested your 
example case?


Anyway, if you look at it from the user's point of view: The wiki says 
spellcheck.onlyMorePopular -- Only return suggestions that result in 
more hits for the query than the existing query. This implies that if 
onlyMorePopular=false I will get even results with less hits. So when 
I'm checking grand I would expect to get the suggestion gran which 
is less frequent in the index. But it seems this is not the case.


But even if just the documentation is wrong or unclear:
1) I could not find a case in which onlyMorePopular=false works at all.
2) It would be nice if one could get suggestion with lower frequency 
than the checked word (which is, to me, what onlyMorePopular=false implies).


Thanks,
Marcus



Re: spellcheck.onlyMorePopular

2009-02-13 Thread Shalin Shekhar Mangar
On Fri, Feb 13, 2009 at 5:05 PM, Marcus Stratmann stratm...@gmx.de wrote:

 Hm, when I try searching for grand using onlyMorePopular=false I do not
 get any results. Same when trying gran. It seems that there will be no
 results at all when using onlyMorePopular=false.


When onlyMorePopular is false and the word you searched exists in the index,
it is returned as-is. Therefore if gran and grand are both present in
the index, they will be returned as is.


 Without onlyMorePopular there are suggestions for both terms, so there are
 suggestions close enough to the original word(s). Have you tested your
 example case?


I am confused by this. Did you mean With onlyMorePopular=true there are
suggestions for both terms?


 Anyway, if you look at it from the user's point of view: The wiki says
 spellcheck.onlyMorePopular -- Only return suggestions that result in more
 hits for the query than the existing query. This implies that if
 onlyMorePopular=false I will get even results with less hits. So when I'm
 checking grand I would expect to get the suggestion gran which is less
 frequent in the index. But it seems this is not the case.


If onlyMorePopular=true, then the algorithm finds tokens which have greater
frequency than the searched term. Among these terms, the one which is
closest (by edit distance) is returned.

I think I now understand the source of the confusion. onlyMorePopular=true
is a special behavior which uses *only* those tokens which have higher
frequency than the searched term. onlyMorePopular=false just switches off
this special behavior. It does *not* limit suggestions to tokens which have
lesser frequency than the searched term. In fact, onlyMorePopular=false does
not use frequency of tokens at all. We should document this clearly to avoid
such confusions in the future.


 2) It would be nice if one could get suggestion with lower frequency than
 the checked word (which is, to me, what onlyMorePopular=false implies).


We could enhance spell checker to do that. But can you please explain your
use-case for limiting suggestions to tokens which have lesser frequency? The
goal of spell checker is to give suggestions of wrongly spelled words. It
was neither designed nor intended to give any other sort of query
suggestions.

-- 
Regards,
Shalin Shekhar Mangar.


Re: spellcheck.onlyMorePopular

2009-02-13 Thread Marcus Stratmann

Shalin Shekhar Mangar wrote:

If onlyMorePopular=true, then the algorithm finds tokens which have greater
frequency than the searched term. Among these terms, the one which is
closest (by edit distance) is returned.


Okay, this is a bit weird, but I think I got it now. Let me try to 
explain it using my example. When I search for gran (frequency 10) I 
get the suggestion grand (frequency 17) when using 
onlyMorePopular=true. When I use onlyMorePopular=false there are no 
suggestions at all. This is because there are some (rare) terms which 
are  closer to gran than grand, but all of them are not considered, 
because there frequency is below 10. Is that correct?
But then, why isn't grand promoted to first place and returned as a 
valid suggestion?




I think I now understand the source of the confusion. onlyMorePopular=true
is a special behavior which uses *only* those tokens which have higher
frequency than the searched term. onlyMorePopular=false just switches off
this special behavior. It does *not* limit suggestions to tokens which have
lesser frequency than the searched term. In fact, onlyMorePopular=false does
not use frequency of tokens at all. We should document this clearly to avoid
such confusions in the future.


I'm still missing the two parameters accuracy and spellcheck.count. Let 
me try to explain how I (now) think the algorithm works:


1) Take all terms from the index as a basic set.
2) If onlyMorePopular=true remove all terms from the basic set which 
have a frequency below the frequency of the search term.
3) Sort the basic set in respect of distance to the search term and keep 
the spellcheck.count terms whith the smallest distance and which are 
within accuracy.
4) Remove of terms which have a lower frequency than the search term in 
the case onlyMorePopular=false.

5) Return the remaining terms as suggestions.

Point 3 would explain why I do not get any suggestions for gran having
onlyMorePopular=false. Nevertheless I think this is a bug since point 3 
should take into account the frequency as well and promote suggestions 
with high enough frequency if suggestion with low frequency are deleted.


But this is just my assumption on how the algorithm works which explains 
why there are no suggestions using onlyMorePopular=false. Maybe I am 
wrong, but somewhere in the process grand is deleted from the result set.




2) It would be nice if one could get suggestion with lower frequency than
the checked word (which is, to me, what onlyMorePopular=false implies).


We could enhance spell checker to do that. But can you please explain your
use-case for limiting suggestions to tokens which have lesser frequency? The
goal of spell checker is to give suggestions of wrongly spelled words. It
was neither designed nor intended to give any other sort of query
suggestions.


An example would be the mentioned grand turismo (regard that in the 
example above I was searching for gran whereas now I am searching for 
grand). gran would not be returned as a suggestion because grand 
is more frequent in the index. And yes, I know, returning a suggestion 
in this case will be only useful if there is more than one word in the 
search term. You proposed to use KeywordTokenizer for this case but a) I 
(again) was not able to find any documentation for this and b) we are 
working on a different solution for this case using stored search 
queries. If you are interested, it works like this: For every word in 
the query get some spell checking suggestions. Combine these and find 
out if any of these combinations has been search for (successfully) 
before. Propose the one with the highest (search) frequency. Looks 
promising so far, but the gran turismo example won't work, since there 
are too many grands in the index.


Thanks,
Marcus


Re: spellcheck.onlyMorePopular

2009-02-13 Thread Walter Underwood
Fuzzy search should match grand turismo to gran turismo without
using spelling suggestions. At Netflix, the first hit for the
query grand turismo is the movie Gran Torino and we use fuzzy
with Solr.

wunder

On 2/13/09 3:35 AM, Marcus Stratmann stratm...@gmx.de wrote:

 Shalin Shekhar Mangar wrote:
 And to come back to my last question: There seems to be no case in which
 onlyMorePopular=false makes sense (provided Grant's assumption is
 correct). Do you see one?
 
 Here's a use-case -- you provide a mis-spelled word and you want the closest
 suggestion by edit distance (frequency does not matter).
 
 Hm, when I try searching for grand using onlyMorePopular=false I do
 not get any results. Same when trying gran. It seems that there will
 be no results at all when using onlyMorePopular=false. Without
 onlyMorePopular there are suggestions for both terms, so there are
 suggestions close enough to the original word(s). Have you tested your
 example case?
 
 Anyway, if you look at it from the user's point of view: The wiki says
 spellcheck.onlyMorePopular -- Only return suggestions that result in
 more hits for the query than the existing query. This implies that if
 onlyMorePopular=false I will get even results with less hits. So when
 I'm checking grand I would expect to get the suggestion gran which
 is less frequent in the index. But it seems this is not the case.
 
 But even if just the documentation is wrong or unclear:
 1) I could not find a case in which onlyMorePopular=false works at all.
 2) It would be nice if one could get suggestion with lower frequency
 than the checked word (which is, to me, what onlyMorePopular=false implies).
 
 Thanks,
 Marcus
 



Re: spellcheck.onlyMorePopular

2009-02-13 Thread Shalin Shekhar Mangar
On Fri, Feb 13, 2009 at 8:46 PM, Marcus Stratmann stratm...@gmx.de wrote:


 Okay, this is a bit weird, but I think I got it now. Let me try to explain
 it using my example. When I search for gran (frequency 10) I get the
 suggestion grand (frequency 17) when using onlyMorePopular=true. When I
 use onlyMorePopular=false there are no suggestions at all. This is because
 there are some (rare) terms which are  closer to gran than grand, but
 all of them are not considered, because there frequency is below 10. Is that
 correct?


No. Think of onlyMorePopular as a toggle between whether to consider
frequency or not. When you say onlyMorePopular=true, higher frequency terms
are considered. When you say onlyMorePopular=false, frequency plays no role
at all and gran is returned because according to the spell checker, it
exists in the index and is therefore a correctly spelled term.


 I'm still missing the two parameters accuracy and spellcheck.count. Let me
 try to explain how I (now) think the algorithm works:

 1) Take all terms from the index as a basic set.
 2) If onlyMorePopular=true remove all terms from the basic set which have a
 frequency below the frequency of the search term.
 3) Sort the basic set in respect of distance to the search term and keep
 the spellcheck.count terms whith the smallest distance and which are
 within accuracy.
 4) Remove of terms which have a lower frequency than the search term in the
 case onlyMorePopular=false.
 5) Return the remaining terms as suggestions.

 Point 3 would explain why I do not get any suggestions for gran having
 onlyMorePopular=false. Nevertheless I think this is a bug since point 3
 should take into account the frequency as well and promote suggestions with
 high enough frequency if suggestion with low frequency are deleted.

 But this is just my assumption on how the algorithm works which explains
 why there are no suggestions using onlyMorePopular=false. Maybe I am wrong,
 but somewhere in the process grand is deleted from the result set.


Point #4 is incorrect. As I said earlier, when onlyMorePopular=false,
frequency information is not used and there is no filtering of tokens with
respect to frequency.

The implementation is a bit more complicated.

1. Read all tokens from the specified field in the solr index.
2. Create n-grams of the terms read in #1 and index them into a separate
Lucene index (spellcheck index).
3. When asked for suggestions, create n-grams of the query terms, search the
spellcheck index and collects the top (by lucene score) 10*spellcheck.count
results.
4. If onlyMorePopular=true, determine frequency of each result in the solr
index and remove terms which have lesser frequency.
5. Compute the edit distance between the result and the query token.
6. Return the top spellcheck.count results (sorted by edit distance
descending) which are greater than specified accuracy.


 An example would be the mentioned grand turismo (regard that in the
 example above I was searching for gran whereas now I am searching for
 grand). gran would not be returned as a suggestion because grand is
 more frequent in the index. And yes, I know, returning a suggestion in this
 case will be only useful if there is more than one word in the search term.
 You proposed to use KeywordTokenizer for this case but a) I (again) was not
 able to find any documentation for this and b) we are working on a different
 solution for this case using stored search queries. If you are interested,
 it works like this: For every word in the query get some spell checking
 suggestions. Combine these and find out if any of these combinations has
 been search for (successfully) before. Propose the one with the highest
 (search) frequency. Looks promising so far, but the gran turismo example
 won't work, since there are too many grands in the index.


Your primary use-case is not spellcheck at all but this might work with some
hacking. Fuzzy queries may be a better solution as Walter said. Storing, all
successful search queries may be hard to scale.

-- 
Regards,
Shalin Shekhar Mangar.


Re: spellcheck.onlyMorePopular

2009-02-12 Thread Grant Ingersoll
I believe the reason is b/c when onlyMP is false, if the word itself  
is already in the index, it short circuits out.  When onlyMP is true,  
it checks to see if there are more frequently occurring variations.


However, I don't have the code in front of me at the moment, so I  
can't verify.


-Grant

On Feb 12, 2009, at 8:07 AM, Marcus Stratmann wrote:


Hello,

I have another question concerning the spell checking mechanism.
Setting onlyMorePopular=true and using the parameters

spellcheck 
=truespellcheck.q=granq=granspellcheck.onlyMorePopular=true


I get the result

lst name=spellcheck
lst name=suggestions
 lst name=gran
  int name=numFound1/int
  int name=startOffset0/int
  int name=endOffset4/int
  int name=origFreq13/int
  lst name=suggestion
   int name=frequency32/int
   str name=wordgrand/str
  /lst
 /lst
 bool name=correctlySpelledtrue/bool
/lst
/lst

which is okay.
But when I turn off onlyMorePopular

spellcheck 
=truespellcheck.q=granq=granspellcheck.onlyMorePopular=false


the output is

lst name=spellcheck
lst name=suggestions/
/lst

I was expecting to get *more* results when I turn off  
onlyMorePopular and to get all of the results contained in the  
result without onlyMorePopular (grand) plus some more. Instead I  
get no spell check results at all. Why is that?


Thanks,
Marcus