Thank you very much. As you told me I just added a single line in the jsp page mentioning the charset as utf-8 and it worked like a charm. Thank you.
KK On Thu, May 21, 2009 at 5:47 PM, Uwe Schindler <u...@thetaphi.de> wrote: > If you print the result e.g. to a webpage through the servlet API, the > output is done with ISO-8859-1 (which is the default for HTTP). If you want > to change this, you must tell the servlet layer the encoding before getting > a PrintWriter (response.setEncoding(), response.setContentTpe("text/html; > charset=UTF-8") or something like that. Or just get the ServletOutputStream > and convert using a OutputStreamWriter just as before. But you have to tell > the browser the encoding... (which is done through the Content-Type header > step). This all is not Lucene specific, so you should ask on a > Tomcat/Jetty/whatever-container-you use list. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message----- > > From: KK [mailto:dioxide.softw...@gmail.com] > > Sent: Thursday, May 21, 2009 7:01 PM > > To: java-user@lucene.apache.org > > Subject: Re: Posting unicode data to lucene not working during > > searching/retreival! > > > > I did all the changes but no improvement. the data is getting indexed > > properly, I think because I'm able to see the results through luke and > > luke > > has option for seeing the results in both utf-8 encoding and string > > default > > encoding. I tried to use both but no difference. In both the cases I'm > > able > > to see the regional text. but no through the browser . How to decoding > > when > > fetching the search results throught searcher? > > > > Thanks > > KK > > > > On Thu, May 21, 2009 at 1:05 PM, KK <dioxide.softw...@gmail.com> wrote: > > > > > Thanks @Uwe. > > > #To answer your last mails query, textOnly is the output of the method > > > downloadPage(), complete text thing includeing all html tags etc... > > > #Instead of doing the encode/decode later, what i should do is when > > > downloading the page through buffered reader put the charset as utf-8 > as > > you > > > mentioned in your last mail. so instead of > > > BufferedReader reader = > > > new BufferedReader(new InputStreamReader( > > > pageUrl.openStream())); > > > > > > I should do this, > > > BufferedReader reader = > > > new BufferedReader(new InputStreamReader( > > > pageUrl.openStream(), <mention the charset like > > > Charset.forName("UTF-8")>)); > > > > > > right? and remove this conversion that I'm doing later , > > > > > > byte [] utfEncodeByteArray = textOnly.getBytes(); > > > String utfString = new String(utfEncodeByteArray, > Charset.forName("UTF- > > > 8")); > > > > > > This will make sure I'm not depending on the platform encoding, right? > > This > > > seems to fix my indexing issue. Now regarding searching I dont need to > > > mention any charset thing there, I'm using stardard anyalyzer? As I > know > > > lucene stores the chars as raw unicode so when I present my query in > the > > > same unicode format lucene will give me proper results. Currently I'm > > not > > > using the encoding for HTTP parameters, I'll use that and let you know. > > > Thank you very much. > > > > > > KK, > > > > > > > > > On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <u...@thetaphi.de> > wrote: > > > > > >> I forgot: > > >> > > >> > byte [] utfEncodeByteArray = textOnly.getBytes(); > > >> > String utfString = new String(utfEncodeByteArray, > > Charset.forName("UTF- > > >> > 8")); > > >> > > > >> > here textonly is the text extracted from the downloaded page > > >> > > >> What is textonly here? A String, if yes, why decode and then again > > encode > > >> it? The important thing is: > > >> Strings in Java are always invariant to charsets (internally they are > > >> UTF-16). So if you convert a byte array to a string you have to > specify > > a > > >> charset (as you have done in new String code). If you convert a String > > to > > >> a > > >> byte array, you must do the same. > > >> > > >> As mentioned in the mail before, the same is true, when converting > > >> InputStreams to Readers and Writers to OutputStreams (this can be done > > >> using > > >> the converter). > > >> > > >> And: If you get a String from somewhere, that looks bad, you cannot > > >> convert > > >> the String to another encoding, it was corrupted during conversion to > > >> string > > >> before. > > >> > > >> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify > > the > > >> input encoding of the HTTP parameters and so on. > > >> > > >> Uwe > > >> > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >