Re: Of, To, and Other Small Words

Alexandre Rafalovitch Mon, 14 Jul 2014 18:15:52 -0700

Have you tried the Admin UI's Analyze screen. Because it will show you
what happens to the text as it progresses through the tokenizers and
filters. No need to reindex.


Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:10 AM, Teague James <teag...@insystechinc.com> wrote:
> Hi Anshum,
>
> Thanks for replying and suggesting this, but the field type I am using (a 
> modified text_general) in my schema has the file set to 'stopwords.txt'.
>
>         <fieldType name="text_general" class="solr.TextField" 
> positionIncrementGap="100">
>                 <analyzer type="index">
>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>                         <filter class="solr.StopFilterFactory" 
> ignoreCase="true" words="stopwords.txt" />
>                         <!-- in this example, we will only use synonyms at 
> query time
>                         <filter class="solr.SynonymFilterFactory" 
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>                         <filter class="solr.LowerCaseFilterFactory"/>
>                         <!-- CHANGE: The NGramFilterFactory was added to 
> provide partial word search. This can be changed to
>                         EdgeNGramFilterFactory side="front" to only match 
> front sided partial searches if matching any
>                         part of a word is undesireable.-->
>                         <filter class="solr.NGramFilterFactory" 
> minGramSize="3" maxGramSize="10" />
>                         <!-- CHANGE: The PorterStemFilterFactory was added to 
> allow matches for 'cat' and 'cats' by searching for 'cat' -->
>                         <filter class="solr.PorterStemFilterFactory"/>
>                 </analyzer>
>                 <analyzer type="query">
>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>                         <filter class="solr.StopFilterFactory" 
> ignoreCase="true" words="stopwords.txt" />
>                         <filter class="solr.SynonymFilterFactory" 
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                         <filter class="solr.LowerCaseFilterFactory"/>
>                         <!-- CHANGE: The PorterStemFilterFactory was added to 
> allow matches for 'cat' and 'cats' by searching for 'cat' -->
>                         <filter class="solr.PorterStemFilterFactory"/>
>                 </analyzer>
>         </fieldType>
>
> Just to be double sure I cleared the list in stopwords_en.txt, restarted 
> Solr, re-indexed, and searched with still zero results. Any other suggestions 
> on where I might be able to control this behavior?
>
> -Teague
>
>
> -----Original Message-----
> From: Anshum Gupta [mailto:ans...@anshumgupta.net]
> Sent: Monday, July 14, 2014 4:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Of, To, and Other Small Words
>
> Hi Teague,
>
> The StopFilterFactory (which I think you're using) by default uses 
> lang/stopwords_en.txt (which wouldn't be empty if you check).
> What you're looking at is the stopword.txt. You could either empty that file 
> out or change the field type for your field.
>
>
> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <teag...@insystechinc.com> 
> wrote:
>> Hello all,
>>
>> I am working with Solr 4.9.0 and am searching for phrases that contain
>> words like "of" or "to" that Solr seems to be ignoring at index time.
>> Here's what I tried:
>>
>> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
>> --data-binary '<add><doc><field name="id">100</field><field
>> name="content">blah blah blah knowledge of science blah blah
>> blah</field></doc></add>'
>>
>> Then, using a broswer:
>>
>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i
>> d:100
>>
>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>> "knowledge of" or "of science" and I get zero hits. I don't want to
>> use proximity if I can avoid it, as this may introduce too many
>> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring 
>> "of" and "to"
>> and possibly more words that I have not discovered through testing
>> yet. Is there some other configuration file that contains these small
>> words? Is there any way to force Solr to pay attention to them and not
>> drop them from the phrase? Any advice is appreciated! Thanks!
>>
>> -Teague
>>
>>
>
>
>
> --
>
> Anshum Gupta
> http://www.anshumgupta.net
>

Re: Of, To, and Other Small Words

Reply via email to