Re: Russian stopwords

2013-05-24 Thread igiguere
Just so everyone knows :

It turns out my stopwords.txt was OK after all.  It functions correctly on a
Linux (ubuntu), and, strangely, on a colleague's Windows 7.  My computer is
also Windows 7.  The only difference between the 2 Windows is the language
of the interface (French for mine, English for my colleague).

Strange... Very very strange.  I hope someone from Microsoft reads this
someday.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Russian-stopwords-tp491490p4065910.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Russian stopwords

2013-05-24 Thread Alexandre Rafalovitch
Sounds like maybe UTF-specific issue when you are _reading it in_. See
if you can change the default locale before starting Java Process (I
think it is an environmental variable) and check if that makes an
impact.

If you have a very easy test-case, I would be happy to check it on Mac
and Windows. I know Russian (and UTF-8 issues).

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Fri, May 24, 2013 at 9:01 AM, igiguere igigu...@opentext.com wrote:
 Just so everyone knows :

 It turns out my stopwords.txt was OK after all.  It functions correctly on a
 Linux (ubuntu), and, strangely, on a colleague's Windows 7.  My computer is
 also Windows 7.  The only difference between the 2 Windows is the language
 of the interface (French for mine, English for my colleague).

 Strange... Very very strange.  I hope someone from Microsoft reads this
 someday.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Russian-stopwords-tp491490p4065910.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Russian stopwords

2013-05-24 Thread igiguere
A colleague stumbled upon this :

http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding

The second answer, environment variable JAVA_TOOL_OPTIONS did the job.

JAVA_TOOL_OPTIONS : -Dfile.encoding=UTF8

Happy stop-wording !




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Russian-stopwords-tp491490p4065976.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Russian stopwords

2013-05-22 Thread igiguere
I'm encountering the same issue, but, my Russian stopwords.txt IS encoded in
UTF-8.

I verified the encoding using EmEditor (I've used it for years, and I use it
for the existing English, French, Spanish, Portuguese and German Solr
configurations, without issues).
Just to make extra sure, I downloaded Edit Plus, as mentioned in this
thread, and verified the encoding again: UTF-8

I realize this will pass for a stupid question, but... Could there be any
issue other than encoding ?

Thanks;



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Russian-stopwords-tp491490p4065440.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Russian stopwords

2008-12-06 Thread tushar kapoor

Hi Steve,

You were right,it turned out to be a an encoding issue but a really weird
one. I was using windows notepad   to save the stopwords file in UTF-8
encoding. On the other hand I was using editplus to save synonyms file. That
was the only difference. The moment I switched to editplus for saving
stopwords file it started working for Russian, German and all type of
languages.

Anyways Thanks for the suggesting a valid direction.

Regards,
Tushar.


Steven A Rowe wrote:
 
 Hi Tushar,
 
 On 12/05/2008 at 5:18 AM, tushar kapoor wrote:
 I am trying to filter russian stopwords but have not been
 successful with that.
 [...]
   filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=false/
 [...]
 Intrestingly, Russian synonyms are working fine. English and russian
 synonyms get searched correctly.

 Also,If I add an English language word to stopwords.txt it
 gets filtered correctly. Its the russian words that are not
 getting filtered as stopwords.
 
 It might be an encoding issue - StopFilterFactory delegates stopword file
 reading to SolrResourceLoader.getLines(), which uses an InputStreamReader
 instantiated with the UTF-8 charset.  Is your stopwords.txt encoded as
 UTF-8?
 
 It's strange that synonyms are working fine, though - SynonymFilterFactory
 reads in the synonyms file using the same mechanism as StopFilterFactory -
 is it possible that your synonyms file is encoded as UTF-8, but your
 stopwords file is encoded with a different encoding, perhaps KOI8-R?  Like
 UTF-8, KOI8-R includes the entirety of 7-bit ASCII, so English words would
 be properly decoded under UTF-8.
 
 Steve
 
 

-- 
View this message in context: 
http://www.nabble.com/Russian-stopwords-tp20851093p20868126.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Russian stopwords

2008-12-06 Thread Lance Norskog
The default encoding on windows is not UTF-8. This causes various weirdness
when you develop on Windows. This has helped me find all places in
string-handling that need the encoding name parameter, so it's not all bad. 

Lance 

-Original Message-
From: tushar kapoor [mailto:[EMAIL PROTECTED] 
Sent: Saturday, December 06, 2008 1:17 AM
To: solr-user@lucene.apache.org
Subject: RE: Russian stopwords


Hi Steve,

You were right,it turned out to be a an encoding issue but a really weird
one. I was using windows notepad   to save the stopwords file in UTF-8
encoding. On the other hand I was using editplus to save synonyms file. That
was the only difference. The moment I switched to editplus for saving
stopwords file it started working for Russian, German and all type of
languages.

Anyways Thanks for the suggesting a valid direction.

Regards,
Tushar.


Steven A Rowe wrote:
 
 Hi Tushar,
 
 On 12/05/2008 at 5:18 AM, tushar kapoor wrote:
 I am trying to filter russian stopwords but have not been successful 
 with that.
 [...]
   filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=false/
 [...]
 Intrestingly, Russian synonyms are working fine. English and russian 
 synonyms get searched correctly.

 Also,If I add an English language word to stopwords.txt it gets 
 filtered correctly. Its the russian words that are not getting 
 filtered as stopwords.
 
 It might be an encoding issue - StopFilterFactory delegates stopword 
 file reading to SolrResourceLoader.getLines(), which uses an 
 InputStreamReader instantiated with the UTF-8 charset.  Is your 
 stopwords.txt encoded as UTF-8?
 
 It's strange that synonyms are working fine, though - 
 SynonymFilterFactory reads in the synonyms file using the same 
 mechanism as StopFilterFactory - is it possible that your synonyms 
 file is encoded as UTF-8, but your stopwords file is encoded with a 
 different encoding, perhaps KOI8-R?  Like UTF-8, KOI8-R includes the 
 entirety of 7-bit ASCII, so English words would be properly decoded under
UTF-8.
 
 Steve
 
 

--
View this message in context:
http://www.nabble.com/Russian-stopwords-tp20851093p20868126.html
Sent from the Solr - User mailing list archive at Nabble.com.




Russian stopwords

2008-12-05 Thread tushar kapoor

I am trying to filter russian stopwords but have not been successful with
that. I am using the following schema entry -

.
 fieldType name=text class=solr.TextField 
   analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true  
expand=false/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=0 generateNumberParts=0 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType
..

Intrestingly, Russian synonyms are working fine. English and russian
synonyms get searched correctly.

Also,If I add an English language word to stopwords.txt it gets filtered
correctly. Its the russian words that are not getting filtered as stopwords.

Can someone explain the behaviour.

Thanks,
Tushar.
-- 
View this message in context: 
http://www.nabble.com/Russian-stopwords-tp20851093p20851093.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Russian stopwords

2008-12-05 Thread Steven A Rowe
Hi Tushar,

On 12/05/2008 at 5:18 AM, tushar kapoor wrote:
 I am trying to filter russian stopwords but have not been
 successful with that.
[...]
filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=false/
[...]
 Intrestingly, Russian synonyms are working fine. English and russian
 synonyms get searched correctly.

 Also,If I add an English language word to stopwords.txt it
 gets filtered correctly. Its the russian words that are not
 getting filtered as stopwords.

It might be an encoding issue - StopFilterFactory delegates stopword file 
reading to SolrResourceLoader.getLines(), which uses an InputStreamReader 
instantiated with the UTF-8 charset.  Is your stopwords.txt encoded as UTF-8?

It's strange that synonyms are working fine, though - SynonymFilterFactory 
reads in the synonyms file using the same mechanism as StopFilterFactory - is 
it possible that your synonyms file is encoded as UTF-8, but your stopwords 
file is encoded with a different encoding, perhaps KOI8-R?  Like UTF-8, KOI8-R 
includes the entirety of 7-bit ASCII, so English words would be properly 
decoded under UTF-8.

Steve