You could try experimenting with CommonGramsFilterFactory and
CommonGramsQueryFilter (slightly different). There is actually a lot
of cool analyzers bundled with Solr. You can find full list on my site
at: http://www.solr-start.com/info/analyzers
Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
On Tue, Jul 15, 2014 at 8:42 AM, Teague James <teag...@insystechinc.com>
wrote:
> Alex,
>
> Thanks! Great suggestion. I figured out that it was the
EdgeNGramFilterFactory. Taking that out of the mix did it.
>
> -Teague
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Monday, July 14, 2014 9:14 PM
> To: solr-user
> Subject: Re: Of, To, and Other Small Words
>
> Have you tried the Admin UI's Analyze screen. Because it will show you
what happens to the text as it progresses through the tokenizers and
filters. No need to reindex.
>
> Regards,
> Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources:
http://www.solr-start.com/ and @solrstart Solr popularizers community:
https://www.linkedin.com/groups?gid=6713853
>
>
> On Tue, Jul 15, 2014 at 8:10 AM, Teague James
> <teag...@insystechinc.com>
wrote:
>> Hi Anshum,
>>
>> Thanks for replying and suggesting this, but the field type I am using
(a modified text_general) in my schema has the file set to
'stopwords.txt'.
>>
>> <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer
class="solr.StandardTokenizerFactory"/>
>> <filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt" />
>> <!-- in this example, we will only use
>> synonyms
at query time
>> <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <!-- CHANGE: The NGramFilterFactory was added
to provide partial word search. This can be changed to
>> EdgeNGramFilterFactory side="front" to only
match front sided partial searches if matching any
>> part of a word is undesireable.-->
>> <filter class="solr.NGramFilterFactory"
minGramSize="3" maxGramSize="10" />
>> <!-- CHANGE: The PorterStemFilterFactory was
added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>> <filter class="solr.PorterStemFilterFactory"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer
class="solr.StandardTokenizerFactory"/>
>> <filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt" />
>> <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <!-- CHANGE: The PorterStemFilterFactory was
added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>> <filter class="solr.PorterStemFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> Just to be double sure I cleared the list in stopwords_en.txt,
restarted Solr, re-indexed, and searched with still zero results. Any
other
suggestions on where I might be able to control this behavior?
>>
>> -Teague
>>
>>
>> -----Original Message-----
>> From: Anshum Gupta [mailto:ans...@anshumgupta.net]
>> Sent: Monday, July 14, 2014 4:04 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Of, To, and Other Small Words
>>
>> Hi Teague,
>>
>> The StopFilterFactory (which I think you're using) by default uses
lang/stopwords_en.txt (which wouldn't be empty if you check).
>> What you're looking at is the stopword.txt. You could either empty
>> that
file out or change the field type for your field.
>>
>>
>> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <
teag...@insystechinc.com> wrote:
>>> Hello all,
>>>
>>> I am working with Solr 4.9.0 and am searching for phrases that
>>> contain words like "of" or "to" that Solr seems to be ignoring at
index time.
>>> Here's what I tried:
>>>
>>> curl http://localhost/solr/update?commit=true -H "Content-Type:
text/xml"
>>> --data-binary '<add><doc><field name="id">100</field><field
>>> name="content">blah blah blah knowledge of science blah blah
>>> blah</field></doc></add>'
>>>
>>> Then, using a broswer:
>>>
>>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>>> i
>>> d:100
>>>
>>> I get zero hits. Search for "knowledge" or "science" and I'll get
>>> hits.
>>> "knowledge of" or "of science" and I get zero hits. I don't want to
>>> use proximity if I can avoid it, as this may introduce too many
>>> undesireable results. Stopwords.txt is blank, yet clearly Solr is
ignoring "of" and "to"
>>> and possibly more words that I have not discovered through testing
>>> yet. Is there some other configuration file that contains these small
>>> words? Is there any way to force Solr to pay attention to them and
>>> not drop them from the phrase? Any advice is appreciated! Thanks!
>>>
>>> -Teague
>>>
>>>
>>
>>
>>
>> --
>>
>> Anshum Gupta
>> http://www.anshumgupta.net
>>
>