problems with search on Russian content

2002-11-21 Thread Andrey Grishin
Hi All, 
I have a problems with searching on Russian content using lucene 1.2

I indexed the content using Cp1251 charset

text = new String(text.getBytes("Cp1251"));
doc.add(Field.Text(CONTENT_FIELD,text));


and I am searching using the same charset

String txt = "áÎÄ";
txt = new String(txt.getBytes("Cp1251"));
PrefixQuery query = new PrefixQuery(new Term(PortalHTMLDocument.CONTENT_FIELD, txt));
hits = searcher.search(query);

or 

Analyzer analyzer = new StandardAnalyzer();
String txt = "áÎÄÒÅÊ";
txt = new String(txt.getBytes("Cp1251"));
Query query = QueryParser.parse(txt, PortalHTMLDocument.CONTENT_FIELD, analyzer);

hits = searcher.search(query);


and lucene can't find nothing.
Also I checked for the DecodeInterceptor in my server.xml - there isn't any

I tried UTF-8/16 - and got the same result.

Also, if I list all index's content via iterating IndexReader - I can see that my 
russian content is stored in index...
Can you please help me? Do you have any more ideas about what else can be done here to 
fix this problem?

I will appreciate any help.
Thanks, Andrey.

P.S.
I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS


problems with search on Russian content

2003-06-06 Thread Vladimir
Hi!

I have lucene-1.3-rc1 and jdk1.3.1.

What to change in a demonstration example to carry out 
search in html files with coding Cp1251?

Thanks,
Vladimir.
---
Professional hosting for everyone - http://www.host.ru
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problems with search on Russian content

2002-11-21 Thread Otis Gospodnetic
Look at CHANGES.txt document in CVS - there is some new stuff in
org.apache.lucene.analysis.ru package that you will want to use.
Get the Lucene from the nightly build...

Otis

--- Andrey Grishin <[EMAIL PROTECTED]> wrote:
> Hi All, 
> I have a problems with searching on Russian content using lucene 1.2
> 
> I indexed the content using Cp1251 charset
> 
> text = new String(text.getBytes("Cp1251"));
> doc.add(Field.Text(CONTENT_FIELD,text));
> 
> 
> and I am searching using the same charset
> 
> String txt = "áÎÄ";
> txt = new String(txt.getBytes("Cp1251"));
> PrefixQuery query = new PrefixQuery(new
> Term(PortalHTMLDocument.CONTENT_FIELD, txt));
> hits = searcher.search(query);
> 
> or 
> 
> Analyzer analyzer = new StandardAnalyzer();
> String txt = "áÎÄÒÅÊ";
> txt = new String(txt.getBytes("Cp1251"));
> Query query = QueryParser.parse(txt,
> PortalHTMLDocument.CONTENT_FIELD, analyzer);
> 
> hits = searcher.search(query);
> 
> 
> and lucene can't find nothing.
> Also I checked for the DecodeInterceptor in my server.xml - there
> isn't any
> 
> I tried UTF-8/16 - and got the same result.
> 
> Also, if I list all index's content via iterating IndexReader - I can
> see that my russian content is stored in index...
> Can you please help me? Do you have any more ideas about what else
> can be done here to fix this problem?
> 
> I will appreciate any help.
> Thanks, Andrey.
> 
> P.S.
> I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS


__
Do you Yahoo!?
Yahoo! Mail Plus – Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: problems with search on Russian content

2002-11-22 Thread Karl Øie
Hi i took a look at Andrey Grishin russian character problem and found 
something strange happening while we tried to debug it. It seems that 
he has avoided the usual "querying with different encoding than 
indexed" problem as he can dump out correctly encoded russian at all 
points in his application.

Is the strings for terms treated differently than the text stored in 
text fields? The reason i ask is that his russian words are correct in 
the stored text fields, but shows up faulty in a terms() dump. If he 
had a character encoding problem in his application the fields should 
show up faulty as well i think. Even stranger is that i use Lucene 1.2 
successfully for utf-8, iso-8859-1, iso-8859-5 and iso-8859-7. Why is 
this problem showing in russian(Cp1251) and not the other encodings?

Strangeness number two is the theory that if the russian word ",!,_,U" was 
skewed to say "0d66539qw" upon indexing, and the problem was just a 
consistent encoding problem, wouldn't a query with  ",!,_,U" be skewed to 
"0d66539qw" and be found anyway?

mvh karl )*ie


Begin forwarded message:

From: "Andrey Grishin" <[EMAIL PROTECTED]>
Date: Thu Nov 21, 2002  15:13:33 Europe/Oslo
To: "Karl Oie" <[EMAIL PROTECTED]>
Subject: Re: How to include strange characters??

yes, you are right - there are no russian words in returned terms :(((
I've just executed the following
--
IndexReader r =
IndexReader.open("C:\\j\\jakarta-tomcat-4.1.12\\index\\ukrenergo");
TermEnum e = r.terms();
while (e.next()) {
  Term term = (Term) e.term();
  System.out.println("term : " + term.text());
}
--
and got no russian words in result
there are some "strange" terms returned instead of russian:
term : 0d4xvp70w
term : 0d66539qw
term : 0d67les2o
term : 0d6eqgic0
etc.

So, I think we got a problem. THis is great :)), thank you...
but how to fix it?




- Original Message -
From: "Karl ?e" <[EMAIL PROTECTED]>
To: "Andrey Grishin" <[EMAIL PROTECTED]>
Sent: Thursday, November 21, 2002 3:56 PM
Subject: Re: How to include strange characters??


another thing to check is weither the IndexReader.terms() actually
contains your term.

mvh karl oie

On Thursday, Nov 21, 2002, at 14:31 Europe/Oslo, Andrey Grishin wrote:


Karl,
I have the same problem with lucene search within russian content.
I tried all your advises, but lucene still can't find anything :
I indexed the content using Cp1251 charset

text = new String(text.getBytes("Cp1251"));
doc.add(Field.Text(CONTENT_FIELD,text));

and I am searching using the same charset
String txt = ",!,_,U";
txt = new String(txt.getBytes("Cp1251"));
PrefixQuery query = new PrefixQuery(new
Term(PortalHTMLDocument.CONTENT_FIELD, txt));
hits = searcher.search(query);

and lucene can't find nothing.
Also I checked for the DecodeInterceptor in my server.xml - there
isn't any
I tried UTF-8/16 - and got the same result.
if I list all index's content via iterating IndexReader- I can see
that my russian content is stored in index...
Can you please help me? Do you have any more ideas about what else can
be done here to fix this problem?

I will appreciate any help.
Thanks, Andrey.

P.S.
I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: problems with search on Russian content

2002-11-22 Thread Karl Øie
Sorry, my bad! Didn't read this informative post :-)

mvh karl øie


On Thursday, Nov 21, 2002, at 16:35 Europe/Oslo, Otis Gospodnetic wrote:


Look at CHANGES.txt document in CVS - there is some new stuff in
org.apache.lucene.analysis.ru package that you will want to use.
Get the Lucene from the nightly build...

Otis

--- Andrey Grishin <[EMAIL PROTECTED]> wrote:

Hi All,
I have a problems with searching on Russian content using lucene 1.2

I indexed the content using Cp1251 charset

text = new String(text.getBytes("Cp1251"));
doc.add(Field.Text(CONTENT_FIELD,text));


and I am searching using the same charset

String txt = "·Œƒ";
txt = new String(txt.getBytes("Cp1251"));
PrefixQuery query = new PrefixQuery(new
Term(PortalHTMLDocument.CONTENT_FIELD, txt));
hits = searcher.search(query);

or

Analyzer analyzer = new StandardAnalyzer();
String txt = "·Œƒ“≈ ";
txt = new String(txt.getBytes("Cp1251"));
Query query = QueryParser.parse(txt,
PortalHTMLDocument.CONTENT_FIELD, analyzer);

hits = searcher.search(query);


and lucene can't find nothing.
Also I checked for the DecodeInterceptor in my server.xml - there
isn't any

I tried UTF-8/16 - and got the same result.

Also, if I list all index's content via iterating IndexReader - I can
see that my russian content is stored in index...
Can you please help me? Do you have any more ideas about what else
can be done here to fix this problem?

I will appreciate any help.
Thanks, Andrey.

P.S.
I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS



__
Do you Yahoo!?
Yahoo! Mail Plus ñ Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   

For additional commands, e-mail: 




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: problems with search on Russian content

2002-11-25 Thread Andrey Grishin
I got the noghtly build from the CVS

When I am trying to use IndexWriter this way:
writer = new IndexWriter(indexDirectory, new
RussianAnalyzer("Cp1251".toCharArray()), true);
I got the following exception

---
java.lang.ArrayIndexOutOfBoundsException: 7
at
org.apache.lucene.analysis.ru.RussianAnalyzer.makeStopWords(RussianAnalyzer.
java:521)
at org.apache.lucene.analysis.ru.RussianAnalyzer.(RussianAnalyzer.java:473)

---


When I am trying to use it this way:
writer = new IndexWriter(indexDirectory, new
RussianAnalyzer("Cp1251".toCharArray(), new String[] {}), true);
I got the following exception

---
2002-11-25 15:09:09,044
[ua.kiev.softline.services.searcher.index.PublishingIndexerImpl]
INFO   - --Throwable in addArticle(): java.lang.ArrayIndexOut
OfBoundsException: 8
java.lang.ArrayIndexOutOfBoundsException: 8
at
org.apache.lucene.analysis.ru.RussianStemmer.isVowel(RussianStemmer.java:991
)
at
org.apache.lucene.analysis.ru.RussianStemmer.markPositions(RussianStemmer.ja
va:909)
at
org.apache.lucene.analysis.ru.RussianStemmer.stem(RussianStemmer.java:1551)
at
org.apache.lucene.analysis.ru.RussianStemFilter.next(RussianStemFilter.java:
189)
at
org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:17
0)
at
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:111)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:209)
at
ua.kiev.softline.services.searcher.index.PublishingIndexerImpl.addArticle(Pu
blishingIndexerImpl.java:130)

---

When I commented line 575 in RussianAnalyzer.java
result = new RussianStemFilter(result, charset);
everything works fine - I can search (and find :)) russian words...

Am I doing something wrong?

Regards, Andrey



- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, November 21, 2002 5:35 PM
Subject: Re: problems with search on Russian content


> Look at CHANGES.txt document in CVS - there is some new stuff in
> org.apache.lucene.analysis.ru package that you will want to use.
> Get the Lucene from the nightly build...
>
> Otis
>
> --- Andrey Grishin <[EMAIL PROTECTED]> wrote:
> > Hi All,
> > I have a problems with searching on Russian content using lucene 1.2
> >
> > I indexed the content using Cp1251 charset
> > 
> > text = new String(text.getBytes("Cp1251"));
> > doc.add(Field.Text(CONTENT_FIELD,text));
> >
> > 
> > and I am searching using the same charset
> >
> > String txt = "áÎÄ";
> > txt = new String(txt.getBytes("Cp1251"));
> > PrefixQuery query = new PrefixQuery(new
> > Term(PortalHTMLDocument.CONTENT_FIELD, txt));
> > hits = searcher.search(query);
> >
> > or
> >
> > Analyzer analyzer = new StandardAnalyzer();
> > String txt = "áÎÄÒÅÊ";
> > txt = new String(txt.getBytes("Cp1251"));
> > Query query = QueryParser.parse(txt,
> > PortalHTMLDocument.CONTENT_FIELD, analyzer);
> >
> > hits = searcher.search(query);
> >
> >
> > and lucene can't find nothing.
> > Also I checked for the DecodeInterceptor in my server.xml - there
> > isn't any
> >
> > I tried UTF-8/16 - and got the same result.
> >
> > Also, if I list all index's content via iterating IndexReader - I can
> > see that my russian content is stored in index...
> > Can you please help me? Do you have any more ideas about what else
> > can be done here to fix this problem?
> >
> > I will appreciate any help.
> > Thanks, Andrey.
> >
> > P.S.
> > I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS
>
>
> __
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>