Re: Russian stopwords
Just so everyone knows : It turns out my stopwords.txt was OK after all. It functions correctly on a Linux (ubuntu), and, strangely, on a colleague's Windows 7. My computer is also Windows 7. The only difference between the 2 Windows is the language of the interface (French for mine, English for my colleague). Strange... Very very strange. I hope someone from Microsoft reads this someday. -- View this message in context: http://lucene.472066.n3.nabble.com/Russian-stopwords-tp491490p4065910.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Russian stopwords
Sounds like maybe UTF-specific issue when you are _reading it in_. See if you can change the default locale before starting Java Process (I think it is an environmental variable) and check if that makes an impact. If you have a very easy test-case, I would be happy to check it on Mac and Windows. I know Russian (and UTF-8 issues). Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, May 24, 2013 at 9:01 AM, igiguere igigu...@opentext.com wrote: Just so everyone knows : It turns out my stopwords.txt was OK after all. It functions correctly on a Linux (ubuntu), and, strangely, on a colleague's Windows 7. My computer is also Windows 7. The only difference between the 2 Windows is the language of the interface (French for mine, English for my colleague). Strange... Very very strange. I hope someone from Microsoft reads this someday. -- View this message in context: http://lucene.472066.n3.nabble.com/Russian-stopwords-tp491490p4065910.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Russian stopwords
A colleague stumbled upon this : http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding The second answer, environment variable JAVA_TOOL_OPTIONS did the job. JAVA_TOOL_OPTIONS : -Dfile.encoding=UTF8 Happy stop-wording ! -- View this message in context: http://lucene.472066.n3.nabble.com/Russian-stopwords-tp491490p4065976.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Russian stopwords
I'm encountering the same issue, but, my Russian stopwords.txt IS encoded in UTF-8. I verified the encoding using EmEditor (I've used it for years, and I use it for the existing English, French, Spanish, Portuguese and German Solr configurations, without issues). Just to make extra sure, I downloaded Edit Plus, as mentioned in this thread, and verified the encoding again: UTF-8 I realize this will pass for a stupid question, but... Could there be any issue other than encoding ? Thanks; -- View this message in context: http://lucene.472066.n3.nabble.com/Russian-stopwords-tp491490p4065440.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Russian stopwords
Hi Steve, You were right,it turned out to be a an encoding issue but a really weird one. I was using windows notepad to save the stopwords file in UTF-8 encoding. On the other hand I was using editplus to save synonyms file. That was the only difference. The moment I switched to editplus for saving stopwords file it started working for Russian, German and all type of languages. Anyways Thanks for the suggesting a valid direction. Regards, Tushar. Steven A Rowe wrote: Hi Tushar, On 12/05/2008 at 5:18 AM, tushar kapoor wrote: I am trying to filter russian stopwords but have not been successful with that. [...] filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ [...] Intrestingly, Russian synonyms are working fine. English and russian synonyms get searched correctly. Also,If I add an English language word to stopwords.txt it gets filtered correctly. Its the russian words that are not getting filtered as stopwords. It might be an encoding issue - StopFilterFactory delegates stopword file reading to SolrResourceLoader.getLines(), which uses an InputStreamReader instantiated with the UTF-8 charset. Is your stopwords.txt encoded as UTF-8? It's strange that synonyms are working fine, though - SynonymFilterFactory reads in the synonyms file using the same mechanism as StopFilterFactory - is it possible that your synonyms file is encoded as UTF-8, but your stopwords file is encoded with a different encoding, perhaps KOI8-R? Like UTF-8, KOI8-R includes the entirety of 7-bit ASCII, so English words would be properly decoded under UTF-8. Steve -- View this message in context: http://www.nabble.com/Russian-stopwords-tp20851093p20868126.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Russian stopwords
The default encoding on windows is not UTF-8. This causes various weirdness when you develop on Windows. This has helped me find all places in string-handling that need the encoding name parameter, so it's not all bad. Lance -Original Message- From: tushar kapoor [mailto:[EMAIL PROTECTED] Sent: Saturday, December 06, 2008 1:17 AM To: solr-user@lucene.apache.org Subject: RE: Russian stopwords Hi Steve, You were right,it turned out to be a an encoding issue but a really weird one. I was using windows notepad to save the stopwords file in UTF-8 encoding. On the other hand I was using editplus to save synonyms file. That was the only difference. The moment I switched to editplus for saving stopwords file it started working for Russian, German and all type of languages. Anyways Thanks for the suggesting a valid direction. Regards, Tushar. Steven A Rowe wrote: Hi Tushar, On 12/05/2008 at 5:18 AM, tushar kapoor wrote: I am trying to filter russian stopwords but have not been successful with that. [...] filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ [...] Intrestingly, Russian synonyms are working fine. English and russian synonyms get searched correctly. Also,If I add an English language word to stopwords.txt it gets filtered correctly. Its the russian words that are not getting filtered as stopwords. It might be an encoding issue - StopFilterFactory delegates stopword file reading to SolrResourceLoader.getLines(), which uses an InputStreamReader instantiated with the UTF-8 charset. Is your stopwords.txt encoded as UTF-8? It's strange that synonyms are working fine, though - SynonymFilterFactory reads in the synonyms file using the same mechanism as StopFilterFactory - is it possible that your synonyms file is encoded as UTF-8, but your stopwords file is encoded with a different encoding, perhaps KOI8-R? Like UTF-8, KOI8-R includes the entirety of 7-bit ASCII, so English words would be properly decoded under UTF-8. Steve -- View this message in context: http://www.nabble.com/Russian-stopwords-tp20851093p20868126.html Sent from the Solr - User mailing list archive at Nabble.com.
Russian stopwords
I am trying to filter russian stopwords but have not been successful with that. I am using the following schema entry - . fieldType name=text class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType .. Intrestingly, Russian synonyms are working fine. English and russian synonyms get searched correctly. Also,If I add an English language word to stopwords.txt it gets filtered correctly. Its the russian words that are not getting filtered as stopwords. Can someone explain the behaviour. Thanks, Tushar. -- View this message in context: http://www.nabble.com/Russian-stopwords-tp20851093p20851093.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Russian stopwords
Hi Tushar, On 12/05/2008 at 5:18 AM, tushar kapoor wrote: I am trying to filter russian stopwords but have not been successful with that. [...] filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ [...] Intrestingly, Russian synonyms are working fine. English and russian synonyms get searched correctly. Also,If I add an English language word to stopwords.txt it gets filtered correctly. Its the russian words that are not getting filtered as stopwords. It might be an encoding issue - StopFilterFactory delegates stopword file reading to SolrResourceLoader.getLines(), which uses an InputStreamReader instantiated with the UTF-8 charset. Is your stopwords.txt encoded as UTF-8? It's strange that synonyms are working fine, though - SynonymFilterFactory reads in the synonyms file using the same mechanism as StopFilterFactory - is it possible that your synonyms file is encoded as UTF-8, but your stopwords file is encoded with a different encoding, perhaps KOI8-R? Like UTF-8, KOI8-R includes the entirety of 7-bit ASCII, so English words would be properly decoded under UTF-8. Steve