RE: Posting unicode data to lucene not working during searching/retreival!

Uwe Schindler Thu, 21 May 2009 05:18:49 -0700

If you print the result e.g. to a webpage through the servlet API, the
output is done with ISO-8859-1 (which is the default for HTTP). If you want
to change this, you must tell the servlet layer the encoding before getting
a PrintWriter (response.setEncoding(), response.setContentTpe("text/html;
charset=UTF-8") or something like that. Or just get the ServletOutputStream
and convert using a OutputStreamWriter just as before. But you have to tell
the browser the encoding... (which is done through the Content-Type header
step). This all is not Lucene specific, so you should ask on a
Tomcat/Jetty/whatever-container-you use list.


Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: KK [mailto:[email protected]]
> Sent: Thursday, May 21, 2009 7:01 PM
> To: [email protected]
> Subject: Re: Posting unicode data to lucene not working during
> searching/retreival!
> 
> I did all the changes but no improvement. the data is getting indexed
> properly, I think because I'm able to see the results through luke and
> luke
> has option for seeing the results in both utf-8 encoding and string
> default
> encoding. I tried to use both but no difference. In both the cases I'm
> able
> to see the regional text. but no through the browser . How to decoding
> when
> fetching the search results throught searcher?
> 
> Thanks
> KK
> 
> On Thu, May 21, 2009 at 1:05 PM, KK <[email protected]> wrote:
> 
> > Thanks @Uwe.
> > #To answer your last mails query, textOnly is the output of the method
> > downloadPage(), complete text thing includeing all html tags etc...
> > #Instead of doing the encode/decode later, what i should do is when
> > downloading the page through buffered reader put the charset as utf-8 as
> you
> > mentioned in your last mail. so instead of
> >  BufferedReader reader =
> >                     new BufferedReader(new InputStreamReader(
> >                     pageUrl.openStream()));
> >
> > I should do this,
> > BufferedReader reader =
> >                     new BufferedReader(new InputStreamReader(
> >                      pageUrl.openStream(), <mention the charset like
> > Charset.forName("UTF-8")>));
> >
> > right? and remove this conversion that I'm doing later ,
> >
> > byte [] utfEncodeByteArray = textOnly.getBytes();
> >  String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
> >  8"));
> >
> > This will make sure I'm not depending on the platform encoding, right?
> This
> > seems to fix my indexing issue. Now regarding searching I dont need to
> > mention any charset thing there, I'm using stardard anyalyzer? As I know
> > lucene stores the chars as raw unicode so when I present my query in the
> > same unicode format lucene will give me proper results. Currently I'm
> not
> > using the encoding for HTTP parameters, I'll use that and let you know.
> > Thank you very much.
> >
> > KK,
> >
> >
> > On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <[email protected]> wrote:
> >
> >> I forgot:
> >>
> >> > byte [] utfEncodeByteArray = textOnly.getBytes();
> >> > String utfString = new String(utfEncodeByteArray,
> Charset.forName("UTF-
> >> > 8"));
> >> >
> >> > here textonly is the text extracted from the downloaded page
> >>
> >> What is textonly here? A String, if yes, why decode and then again
> encode
> >> it? The important thing is:
> >> Strings in Java are always invariant to charsets (internally they are
> >> UTF-16). So if you convert a byte array to a string you have to specify
> a
> >> charset (as you have done in new String code). If you convert a String
> to
> >> a
> >> byte array, you must do the same.
> >>
> >> As mentioned in the mail before, the same is true, when converting
> >> InputStreams to Readers and Writers to OutputStreams (this can be done
> >> using
> >> the converter).
> >>
> >> And: If you get a String from somewhere, that looks bad, you cannot
> >> convert
> >> the String to another encoding, it was corrupted during conversion to
> >> string
> >> before.
> >>
> >> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify
> the
> >> input encoding of the HTTP parameters and so on.
> >>
> >> Uwe
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Posting unicode data to lucene not working during searching/retreival!

Reply via email to