Re: Of, To, and Other Small Words

Alexandre Rafalovitch Mon, 14 Jul 2014 18:52:59 -0700

You could try experimenting with CommonGramsFilterFactory and
CommonGramsQueryFilter (slightly different). There is actually a lot
of cool analyzers bundled with Solr. You can find full list on my site
at: http://www.solr-start.com/info/analyzers


Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:42 AM, Teague James <teag...@insystechinc.com> wrote:
> Alex,
>
> Thanks! Great suggestion. I figured out that it was the 
> EdgeNGramFilterFactory. Taking that out of the mix did it.
>
> -Teague
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Monday, July 14, 2014 9:14 PM
> To: solr-user
> Subject: Re: Of, To, and Other Small Words
>
> Have you tried the Admin UI's Analyze screen. Because it will show you what 
> happens to the text as it progresses through the tokenizers and filters. No 
> need to reindex.
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: 
> http://www.solr-start.com/ and @solrstart Solr popularizers community: 
> https://www.linkedin.com/groups?gid=6713853
>
>
> On Tue, Jul 15, 2014 at 8:10 AM, Teague James <teag...@insystechinc.com> 
> wrote:
>> Hi Anshum,
>>
>> Thanks for replying and suggesting this, but the field type I am using (a 
>> modified text_general) in my schema has the file set to 'stopwords.txt'.
>>
>>         <fieldType name="text_general" class="solr.TextField" 
>> positionIncrementGap="100">
>>                 <analyzer type="index">
>>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>>                         <filter class="solr.StopFilterFactory" 
>> ignoreCase="true" words="stopwords.txt" />
>>                         <!-- in this example, we will only use synonyms at 
>> query time
>>                         <filter class="solr.SynonymFilterFactory" 
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>>                         <filter class="solr.LowerCaseFilterFactory"/>
>>                         <!-- CHANGE: The NGramFilterFactory was added to 
>> provide partial word search. This can be changed to
>>                         EdgeNGramFilterFactory side="front" to only match 
>> front sided partial searches if matching any
>>                         part of a word is undesireable.-->
>>                         <filter class="solr.NGramFilterFactory" 
>> minGramSize="3" maxGramSize="10" />
>>                         <!-- CHANGE: The PorterStemFilterFactory was added 
>> to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>                         <filter class="solr.PorterStemFilterFactory"/>
>>                 </analyzer>
>>                 <analyzer type="query">
>>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>>                         <filter class="solr.StopFilterFactory" 
>> ignoreCase="true" words="stopwords.txt" />
>>                         <filter class="solr.SynonymFilterFactory" 
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>                         <filter class="solr.LowerCaseFilterFactory"/>
>>                         <!-- CHANGE: The PorterStemFilterFactory was added 
>> to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>                         <filter class="solr.PorterStemFilterFactory"/>
>>                 </analyzer>
>>         </fieldType>
>>
>> Just to be double sure I cleared the list in stopwords_en.txt, restarted 
>> Solr, re-indexed, and searched with still zero results. Any other 
>> suggestions on where I might be able to control this behavior?
>>
>> -Teague
>>
>>
>> -----Original Message-----
>> From: Anshum Gupta [mailto:ans...@anshumgupta.net]
>> Sent: Monday, July 14, 2014 4:04 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Of, To, and Other Small Words
>>
>> Hi Teague,
>>
>> The StopFilterFactory (which I think you're using) by default uses 
>> lang/stopwords_en.txt (which wouldn't be empty if you check).
>> What you're looking at is the stopword.txt. You could either empty that file 
>> out or change the field type for your field.
>>
>>
>> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <teag...@insystechinc.com> 
>> wrote:
>>> Hello all,
>>>
>>> I am working with Solr 4.9.0 and am searching for phrases that
>>> contain words like "of" or "to" that Solr seems to be ignoring at index 
>>> time.
>>> Here's what I tried:
>>>
>>> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
>>> --data-binary '<add><doc><field name="id">100</field><field
>>> name="content">blah blah blah knowledge of science blah blah
>>> blah</field></doc></add>'
>>>
>>> Then, using a broswer:
>>>
>>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>>> i
>>> d:100
>>>
>>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>>> "knowledge of" or "of science" and I get zero hits. I don't want to
>>> use proximity if I can avoid it, as this may introduce too many
>>> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring 
>>> "of" and "to"
>>> and possibly more words that I have not discovered through testing
>>> yet. Is there some other configuration file that contains these small
>>> words? Is there any way to force Solr to pay attention to them and
>>> not drop them from the phrase? Any advice is appreciated! Thanks!
>>>
>>> -Teague
>>>
>>>
>>
>>
>>
>> --
>>
>> Anshum Gupta
>> http://www.anshumgupta.net
>>
>

Re: Of, To, and Other Small Words

Reply via email to