On 9/22/2011 3:54 AM, Pranav Prakash wrote:
Hi List,

I included StopFilterFactory and I  can see it taking action in the Analyzer
Interface. However, when I go to Schema Analyzer, I see those stop words in
the top 10 terms. Is this normal?

<fieldType name="text_commongrams" class="solr.TextField">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase
="true" expand="true"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="
true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0
"preserveOriginal="1"/>
</analyzer>
</fieldType>


You've got CommonGramsFilterFactory and StopFilterFactory both using stopwords.txt, which is a confusing configuration. Normally you'd want one or the other, not both ... but if you did legitimately have both, you'd want them to each use a different wordlist.

The commongrams filter turns each found occurrence of a word in the file into two tokens - one prepended with the token before it, one appended with the token after it. If it's the first or last term in a field, it only produces one token. When it gets to the stopfilter, the combined terms no longer match what's in stopwords.txt, so no action is taken.

If I had to guess, what you are seeing in the top 10 terms is the concatenation of your most common stopword with another word. If it were English, I would guess that to be "of_the" or something similar. If my guess is wrong, then I'm not sure what's going on, and some cut/paste of what you're actually seeing might be in order. Did you do delete and do a full reindex after you changed your schema?

Thanks,
Shawn

Reply via email to