[jira] [Commented] (SOLR-2585) Context-Sensitive Spelling Suggestions & Collations

James Dyer (JIRA) Fri, 02 Sep 2011 10:33:34 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096153#comment-13096153
 ]


James Dyer commented on SOLR-2585:
----------------------------------

Robert,

Thank you for your comments.  You're right that "onlyMorePopular" will return 
suggestions for terms in the index.  This issue is all about coming up with an 
alternative to the "onlyMorePopular" feature, which I feel is flawed (for what 
I've tried to use it for).  There are three related problems I'm trying to 
solve here:

1. If you specify "onlyMorePopular=true" in Solr, any Collations it tries to 
build will exclude *all* of the user's original terms generating suggestions.  
In other words, it assumes that any words with more-popular alternatives must 
be misspelled.  This makes the "onlyMorePopular" option less-than-useful when 
used with Solr.  For example, in my index of book titles, we have a few titles 
with the words "mist" and "life" and a whole bunch with "most" and "life".  But 
if I fat-finger my query and search for (mist AND lifr), 
"onlyMorePopular=false" returns no collations while "onlyMorePopular=true" 
tries correcting all my words and suggests (most AND life).  It would be better 
if we could get either or both of these collations returned.  What is even more 
maddening is that sometimes the collation possibilities with "onlyMorePopular" 
all return 0 hits because it is throwing out the valid terms in the user's 
query in favor of "more popular" terms.

2. Sometimes a less-frequent correction is desirable.  If a Lucene user is 
trying to find the book titled "..in the midst of divorce" but queries on "mist 
divorce", with my data we won't get "midst" as a suggestion at all even when 
using "onlyMorePopular".  The problem, of course, is that "mist" occurs with a 
higher frequency than "midst".

3. When doing dismax queries, it would be nice to have a "master" dictionary 
that contains a conglomeration of all the terms in all the fields that dismax 
is set to search across.  But when the spellchecker limits itself to terms that 
either are not in the dictonary or to terms that are more-popular, it sometimes 
misses the terms it needs to get the appropriate correction.

I am all in favor in breaking this up into 2 issues as #1 is a solr-only 
problem and #2 involves Lucene also.  (#3 would be solved if we did #1 & #2.)  
I also fully agree that if creating 3 "suggest modes" then it should be for all 
the spellcheckers, not just for DirectSpellChecker.  I was actually thinking 
that creating a common interface (or abstract class) the spellcheckers all 
could implement (or extend) would be a nice follow-up to this, should this ever 
be committed.  (Its difficult for me to maintain multiple uncommitted patches 
with dependencies on each other so I try to do these sorts of things one issue 
at a time...)

Do you agree with this division?  I could split #2 off into a separate 
"LUCENE-" issue, and this issue can be about #1.  (#3 solves itself when then 
other 2 are worked out)

> Context-Sensitive Spelling Suggestions & Collations
> ---------------------------------------------------
>
>                 Key: SOLR-2585
>                 URL: https://issues.apache.org/jira/browse/SOLR-2585
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 4.0
>            Reporter: James Dyer
>            Priority: Minor
>         Attachments: SOLR-2585.patch, SOLR-2585.patch
>
>
> Solr currently cannot offer what I'm calling here a "context-sensitive" 
> spelling suggestion.  That is, if a user enters one or more words that have 
> docFrequency > 0, but nevertheless are misspelled, then no suggestions are 
> offered.  Currently, Solr will always consider a word "correctly spelled" if 
> it is in the index and/or dictionary, regardless of context.  This issue & 
> patch add support for context-sensitive spelling suggestions. 
> See SpellCheckCollatorTest.testContextSensitiveCollate() for a the typical 
> use case for this functionality.  This tests both using 
> IndexBasedSepllChecker and DirectSolrSpellChecker. 
> Two new Spelling Parameters were added:
>   - spellcheck.alternativeTermCount - The count of suggestions to return for 
> each query term existing in the index and/or dictionary.  Presumably, users 
> will want fewer suggestions for words with docFrequency>0.  Also setting this 
> value turns "on" context-sensitive spell suggestions. 
>   - spellcheck.maxResultsForSuggest - The maximum number of hits the request 
> can return in order to both generate spelling suggestions and set the 
> "correctlySpelled" element to "false".  For example, if this is set to 5 and 
> the user's query returns 5 or fewer results, the spellchecker will report 
> "correctlySpelled=false" and also offer suggestions (and collations if 
> requested).  Setting this greater than zero is useful for creating 
> "did-you-mean" suggestions for queries that return a low number of hits.
> I have also included a test using shards.  See additions to 
> DistributedSpellCheckComponentTest. 
> In Lucene, SpellChecker.java can already support this functionality (by 
> passing a null IndexReader and field-name).  The DirectSpellChecker, however, 
> needs a minor enhancement.  This gives the option to allow DirectSpellChecker 
> to return suggestions for all query terms regardless of frequency.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-2585) Context-Sensitive Spelling Suggestions & Collations

Reply via email to