RE: Russian stopwords

Steven A Rowe Fri, 05 Dec 2008 08:55:20 -0800

Hi Tushar,

On 12/05/2008 at 5:18 AM, tushar kapoor wrote:
> I am trying to filter russian stopwords but have not been
> successful with that.
[...]
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>              words="stopwords.txt"/>
>      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>              ignoreCase="true" expand="false"/>
[...]
> Intrestingly, Russian synonyms are working fine. English and russian
> synonyms get searched correctly.
>
> Also,If I add an English language word to stopwords.txt it
> gets filtered correctly. Its the russian words that are not
> getting filtered as stopwords.


It might be an encoding issue - StopFilterFactory delegates stopword file 
reading to SolrResourceLoader.getLines(), which uses an InputStreamReader 
instantiated with the UTF-8 charset.  Is your stopwords.txt encoded as UTF-8?

It's strange that synonyms are working fine, though - SynonymFilterFactory 
reads in the synonyms file using the same mechanism as StopFilterFactory - is 
it possible that your synonyms file is encoded as UTF-8, but your stopwords 
file is encoded with a different encoding, perhaps KOI8-R?  Like UTF-8, KOI8-R 
includes the entirety of 7-bit ASCII, so English words would be properly 
decoded under UTF-8.

Steve

RE: Russian stopwords

Reply via email to