James, Thank you, but I'm not sure that will work for my needs. I'm very interested in contextual spell checking. Take for example the author "stephenie meyer". "stephenie" is a far less popular spelling than "stephanie", but in this context it's the correct option. I feel like shingles with an un tokenized query string would be able to catch this, but I can't find too many examples of people attempting this.
On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James <james.d...@ingrambook.com>wrote: > Camden, > > You may also want to be aware that there is a new feature added to Spell > Check's "collate" functionality that will guarantee the collations will > return hits. It also is able to return more than one collation and tell you > how many hits each one would result in if re-queried. This might do the > same thing you're trying to do using shingles, but with more accuracy and > less work. > > For info, look at "spellcheck.collate", "spellcheck.maxCollations", > "spellcheck.maxCollationTries" & spellcheck.collateExtendedResults" on the > component's wiki page: > http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate > > This feature is committed to 3.x and 4.x and is available as a patch for > 1.4.1 (here: https://issues.apache.org/jira/browse/SOLR-2010). > > James Dyer > E-Commerce Systems > Ingram Content Group > (615) 213-4311 > > > -----Original Message----- > From: Camden Daily [mailto:cam...@jaunter.com] > Sent: Monday, January 17, 2011 1:01 PM > To: solr-user@lucene.apache.org > Subject: Spell Checking a multi word phrase > > Hello all, > > I'm pretty new to Solr, and trying to set up a spell checker that can > handle > entire phrases. My goal would be to have something that could offer a > suggestion of "united states" for a query of "untied stats". > > I have a very large index, and I've worked a bit with creating shingles for > the spelling index. The problem I'm running into now is that the > SpellCheckComponent is always tokenizing the query that I pass to it. > > For example, a query like this > > http://localhost:8080/solr/spell?q=untied\stats&spellcheck=true&debugQuery=on<http://localhost:8080/solr/spell?q=untied%5Cstats&spellcheck=true&debugQuery=on> > > The debug information shows me that the parsed query is: > PhraseQuery(text:"untied stats") > > But I receive the spelling suggestions for "untied" and "stats" separately. > From what I understand, this is not a case where I would want to collate; I > simply want the entire phrase treated as one token. > > I found the following post after much searching that suggests setting up a > custom QueryConverter: > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3c1224516331.3820.119.ca...@localhost.localdomain.tld%3E > > Does anyone know if that would be required? I had hoped to avoid Java code > entirely with Solr (I haven't used Java in a very long time), but if I do > need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be > able to give me some tips of exactly how I would add that functionality to > Solr? > > Relevant configs below: > > solrconfig.xml: > > <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> > <lst name="spellchecker"> > <str name="name">default</str> > <str name="field">spellShingle</str> > <str name="spellcheckIndexDir">./spellShingle</str> > <str name="queryAnalyzerFieldType">textSpellShingle</str> > <str name="buildOnOptimize">true</str> > </lst> > </searchComponent> > > schema.xml: > > <fieldType name="textSpellShingle" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.ShingleFilterFactory" maxShingleSize="2" > outputUnigrams="true"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.KeywordTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > (I had thought setting the KeywordTokenizer for the query analyzer would > keep it from being tokenized, but it doesn't seem to make any difference) > > -Camden Daily >