Hi Tushar, On 12/05/2008 at 5:18 AM, tushar kapoor wrote: > I am trying to filter russian stopwords but have not been > successful with that. [...] > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="false"/> [...] > Intrestingly, Russian synonyms are working fine. English and russian > synonyms get searched correctly. > > Also,If I add an English language word to stopwords.txt it > gets filtered correctly. Its the russian words that are not > getting filtered as stopwords.
It might be an encoding issue - StopFilterFactory delegates stopword file reading to SolrResourceLoader.getLines(), which uses an InputStreamReader instantiated with the UTF-8 charset. Is your stopwords.txt encoded as UTF-8? It's strange that synonyms are working fine, though - SynonymFilterFactory reads in the synonyms file using the same mechanism as StopFilterFactory - is it possible that your synonyms file is encoded as UTF-8, but your stopwords file is encoded with a different encoding, perhaps KOI8-R? Like UTF-8, KOI8-R includes the entirety of 7-bit ASCII, so English words would be properly decoded under UTF-8. Steve