You could try experimenting with CommonGramsFilterFactory and CommonGramsQueryFilter (slightly different). There is actually a lot of cool analyzers bundled with Solr. You can find full list on my site at: http://www.solr-start.com/info/analyzers
Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:42 AM, Teague James <teag...@insystechinc.com> wrote: > Alex, > > Thanks! Great suggestion. I figured out that it was the > EdgeNGramFilterFactory. Taking that out of the mix did it. > > -Teague > > -----Original Message----- > From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] > Sent: Monday, July 14, 2014 9:14 PM > To: solr-user > Subject: Re: Of, To, and Other Small Words > > Have you tried the Admin UI's Analyze screen. Because it will show you what > happens to the text as it progresses through the tokenizers and filters. No > need to reindex. > > Regards, > Alex. > Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: > http://www.solr-start.com/ and @solrstart Solr popularizers community: > https://www.linkedin.com/groups?gid=6713853 > > > On Tue, Jul 15, 2014 at 8:10 AM, Teague James <teag...@insystechinc.com> > wrote: >> Hi Anshum, >> >> Thanks for replying and suggesting this, but the field type I am using (a >> modified text_general) in my schema has the file set to 'stopwords.txt'. >> >> <fieldType name="text_general" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" >> ignoreCase="true" words="stopwords.txt" /> >> <!-- in this example, we will only use synonyms at >> query time >> <filter class="solr.SynonymFilterFactory" >> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>--> >> <filter class="solr.LowerCaseFilterFactory"/> >> <!-- CHANGE: The NGramFilterFactory was added to >> provide partial word search. This can be changed to >> EdgeNGramFilterFactory side="front" to only match >> front sided partial searches if matching any >> part of a word is undesireable.--> >> <filter class="solr.NGramFilterFactory" >> minGramSize="3" maxGramSize="10" /> >> <!-- CHANGE: The PorterStemFilterFactory was added >> to allow matches for 'cat' and 'cats' by searching for 'cat' --> >> <filter class="solr.PorterStemFilterFactory"/> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" >> ignoreCase="true" words="stopwords.txt" /> >> <filter class="solr.SynonymFilterFactory" >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <!-- CHANGE: The PorterStemFilterFactory was added >> to allow matches for 'cat' and 'cats' by searching for 'cat' --> >> <filter class="solr.PorterStemFilterFactory"/> >> </analyzer> >> </fieldType> >> >> Just to be double sure I cleared the list in stopwords_en.txt, restarted >> Solr, re-indexed, and searched with still zero results. Any other >> suggestions on where I might be able to control this behavior? >> >> -Teague >> >> >> -----Original Message----- >> From: Anshum Gupta [mailto:ans...@anshumgupta.net] >> Sent: Monday, July 14, 2014 4:04 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Of, To, and Other Small Words >> >> Hi Teague, >> >> The StopFilterFactory (which I think you're using) by default uses >> lang/stopwords_en.txt (which wouldn't be empty if you check). >> What you're looking at is the stopword.txt. You could either empty that file >> out or change the field type for your field. >> >> >> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <teag...@insystechinc.com> >> wrote: >>> Hello all, >>> >>> I am working with Solr 4.9.0 and am searching for phrases that >>> contain words like "of" or "to" that Solr seems to be ignoring at index >>> time. >>> Here's what I tried: >>> >>> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" >>> --data-binary '<add><doc><field name="id">100</field><field >>> name="content">blah blah blah knowledge of science blah blah >>> blah</field></doc></add>' >>> >>> Then, using a broswer: >>> >>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq= >>> i >>> d:100 >>> >>> I get zero hits. Search for "knowledge" or "science" and I'll get hits. >>> "knowledge of" or "of science" and I get zero hits. I don't want to >>> use proximity if I can avoid it, as this may introduce too many >>> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring >>> "of" and "to" >>> and possibly more words that I have not discovered through testing >>> yet. Is there some other configuration file that contains these small >>> words? Is there any way to force Solr to pay attention to them and >>> not drop them from the phrase? Any advice is appreciated! Thanks! >>> >>> -Teague >>> >>> >> >> >> >> -- >> >> Anshum Gupta >> http://www.anshumgupta.net >> >