Re: Of, To, and Other Small Words

Jack Krupansky Tue, 15 Jul 2014 04:43:23 -0700

Yeah, this is another one of those places where the behavior of Solr isdefined but way down in the Lucene Javadoc, where no Solr user should everhave to go!


It's also the kind of detail documented in my Solr Deep Dive e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html


-- Jack Krupansky

-----Original Message-----From: Alexandre Rafalovitch

Sent: Tuesday, July 15, 2014 4:36 AM
To: solr-user
Subject: Re: Of, To, and Other Small Words

https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51

If you don't set the attribute in XML file, it falls back to the
default definitions.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

On Tue, Jul 15, 2014 at 3:16 PM, Aman Tandon <[email protected]>wrote:

Hi jack,


it will use the internal *Lucene hardwired list* of stop words


I am unaware of this, could you please provide the more information about
this.


With Regards
Aman Tandon

On Tue, Jul 15, 2014 at 7:21 AM, Alexandre Rafalovitch<[email protected]>

wrote:

You could try experimenting with CommonGramsFilterFactory and
CommonGramsQueryFilter (slightly different). There is actually a lot
of cool analyzers bundled with Solr. You can find full list on my site
at: http://www.solr-start.com/info/analyzers

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

On Tue, Jul 15, 2014 at 8:42 AM, Teague James <[email protected]>
wrote:
> Alex,
>
> Thanks! Great suggestion. I figured out that it was the
EdgeNGramFilterFactory. Taking that out of the mix did it.
>
> -Teague
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:[email protected]]
> Sent: Monday, July 14, 2014 9:14 PM
> To: solr-user
> Subject: Re: Of, To, and Other Small Words
>
> Have you tried the Admin UI's Analyze screen. Because it will show you
what happens to the text as it progresses through the tokenizers and
filters. No need to reindex.
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources:
http://www.solr-start.com/ and @solrstart Solr popularizers community:
https://www.linkedin.com/groups?gid=6713853
>
>

> On Tue, Jul 15, 2014 at 8:10 AM, Teague James> <[email protected]>

wrote:
>> Hi Anshum,
>>
>> Thanks for replying and suggesting this, but the field type I am using

(a modified text_general) in my schema has the file set to'stopwords.txt'.

>>
>>         <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
>>                 <analyzer type="index">
>>                         <tokenizer
class="solr.StandardTokenizerFactory"/>
>>                         <filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt" />

>> <!-- in this example, we will only use>> synonyms

at query time
>>                         <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>>                         <filter class="solr.LowerCaseFilterFactory"/>
>>                         <!-- CHANGE: The NGramFilterFactory was added
to provide partial word search. This can be changed to
>>                         EdgeNGramFilterFactory side="front" to only
match front sided partial searches if matching any
>>                         part of a word is undesireable.-->
>>                         <filter class="solr.NGramFilterFactory"
minGramSize="3" maxGramSize="10" />
>>                         <!-- CHANGE: The PorterStemFilterFactory was
added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>                         <filter class="solr.PorterStemFilterFactory"/>
>>                 </analyzer>
>>                 <analyzer type="query">
>>                         <tokenizer
class="solr.StandardTokenizerFactory"/>
>>                         <filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt" />
>>                         <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>                         <filter class="solr.LowerCaseFilterFactory"/>
>>                         <!-- CHANGE: The PorterStemFilterFactory was
added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>                         <filter class="solr.PorterStemFilterFactory"/>
>>                 </analyzer>
>>         </fieldType>
>>
>> Just to be double sure I cleared the list in stopwords_en.txt,

restarted Solr, re-indexed, and searched with still zero results. Anyother

suggestions on where I might be able to control this behavior?
>>
>> -Teague
>>
>>
>> -----Original Message-----
>> From: Anshum Gupta [mailto:[email protected]]
>> Sent: Monday, July 14, 2014 4:04 PM
>> To: [email protected]
>> Subject: Re: Of, To, and Other Small Words
>>
>> Hi Teague,
>>
>> The StopFilterFactory (which I think you're using) by default uses
lang/stopwords_en.txt (which wouldn't be empty if you check).

>> What you're looking at is the stopword.txt. You could either empty>> that

file out or change the field type for your field.
>>
>>
>> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <
[email protected]> wrote:
>>> Hello all,
>>>
>>> I am working with Solr 4.9.0 and am searching for phrases that
>>> contain words like "of" or "to" that Solr seems to be ignoring at
index time.
>>> Here's what I tried:
>>>
>>> curl http://localhost/solr/update?commit=true -H "Content-Type:
text/xml"
>>> --data-binary '<add><doc><field name="id">100</field><field
>>> name="content">blah blah blah knowledge of science blah blah
>>> blah</field></doc></add>'
>>>
>>> Then, using a broswer:
>>>
>>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>>> i
>>> d:100
>>>

>>> I get zero hits. Search for "knowledge" or "science" and I'll get>>> hits.

>>> "knowledge of" or "of science" and I get zero hits. I don't want to
>>> use proximity if I can avoid it, as this may introduce too many
>>> undesireable results. Stopwords.txt is blank, yet clearly Solr is
ignoring "of" and "to"
>>> and possibly more words that I have not discovered through testing
>>> yet. Is there some other configuration file that contains these small
>>> words? Is there any way to force Solr to pay attention to them and
>>> not drop them from the phrase? Any advice is appreciated! Thanks!
>>>
>>> -Teague
>>>
>>>
>>
>>
>>
>> --
>>
>> Anshum Gupta
>> http://www.anshumgupta.net
>>
>

Re: Of, To, and Other Small Words

Reply via email to