If you print the result e.g. to a webpage through the servlet API, the output is done with ISO-8859-1 (which is the default for HTTP). If you want to change this, you must tell the servlet layer the encoding before getting a PrintWriter (response.setEncoding(), response.setContentTpe("text/html; charset=UTF-8") or something like that. Or just get the ServletOutputStream and convert using a OutputStreamWriter just as before. But you have to tell the browser the encoding... (which is done through the Content-Type header step). This all is not Lucene specific, so you should ask on a Tomcat/Jetty/whatever-container-you use list.
Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: KK [mailto:dioxide.softw...@gmail.com] > Sent: Thursday, May 21, 2009 7:01 PM > To: java-user@lucene.apache.org > Subject: Re: Posting unicode data to lucene not working during > searching/retreival! > > I did all the changes but no improvement. the data is getting indexed > properly, I think because I'm able to see the results through luke and > luke > has option for seeing the results in both utf-8 encoding and string > default > encoding. I tried to use both but no difference. In both the cases I'm > able > to see the regional text. but no through the browser . How to decoding > when > fetching the search results throught searcher? > > Thanks > KK > > On Thu, May 21, 2009 at 1:05 PM, KK <dioxide.softw...@gmail.com> wrote: > > > Thanks @Uwe. > > #To answer your last mails query, textOnly is the output of the method > > downloadPage(), complete text thing includeing all html tags etc... > > #Instead of doing the encode/decode later, what i should do is when > > downloading the page through buffered reader put the charset as utf-8 as > you > > mentioned in your last mail. so instead of > > BufferedReader reader = > > new BufferedReader(new InputStreamReader( > > pageUrl.openStream())); > > > > I should do this, > > BufferedReader reader = > > new BufferedReader(new InputStreamReader( > > pageUrl.openStream(), <mention the charset like > > Charset.forName("UTF-8")>)); > > > > right? and remove this conversion that I'm doing later , > > > > byte [] utfEncodeByteArray = textOnly.getBytes(); > > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- > > 8")); > > > > This will make sure I'm not depending on the platform encoding, right? > This > > seems to fix my indexing issue. Now regarding searching I dont need to > > mention any charset thing there, I'm using stardard anyalyzer? As I know > > lucene stores the chars as raw unicode so when I present my query in the > > same unicode format lucene will give me proper results. Currently I'm > not > > using the encoding for HTTP parameters, I'll use that and let you know. > > Thank you very much. > > > > KK, > > > > > > On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <u...@thetaphi.de> wrote: > > > >> I forgot: > >> > >> > byte [] utfEncodeByteArray = textOnly.getBytes(); > >> > String utfString = new String(utfEncodeByteArray, > Charset.forName("UTF- > >> > 8")); > >> > > >> > here textonly is the text extracted from the downloaded page > >> > >> What is textonly here? A String, if yes, why decode and then again > encode > >> it? The important thing is: > >> Strings in Java are always invariant to charsets (internally they are > >> UTF-16). So if you convert a byte array to a string you have to specify > a > >> charset (as you have done in new String code). If you convert a String > to > >> a > >> byte array, you must do the same. > >> > >> As mentioned in the mail before, the same is true, when converting > >> InputStreams to Readers and Writers to OutputStreams (this can be done > >> using > >> the converter). > >> > >> And: If you get a String from somewhere, that looks bad, you cannot > >> convert > >> the String to another encoding, it was corrupted during conversion to > >> string > >> before. > >> > >> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify > the > >> input encoding of the HTTP parameters and so on. > >> > >> Uwe > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org