I don't think I understand correctly your proposal. As a basis, I am using Demo3 with indexHTML, HTMLDocument and HTMLParser. Inside HTML parser, I am calling getMetaTags (calling addMetaData) wich return Properties object. My issue is coming fron this definition : Properties are stored into ISO-8859-1 encoding, when all my data encodings inside and outside are "UTF-8". I am not successful in getting UTF-8 values from this Parser.GetMetaTags() through any conversion. These data are extracted from an HTML page, with UTF-8 encoding declared at the beginning of the file. I do not see how to call a request.setEncoding("UTF-8") : I need the Parser to have knowledge of UTF-8 encoding... And it doesn't appear when using Properties object.
Any feedback? -----Message d'origine----- De : Praveen Peddi [mailto:[EMAIL PROTECTED] Envoyé : jeudi 15 juillet 2004 15:12 À : Lucene Users List Objet : Re: Problems indexing Japanese with CJKAnalyzer If its a web application, you have to cal request.setEncoding("UTF-8") before reading any parameters. Also make sure html page encoding is specified as "UTF-8" in the metatag. most web app servers decode the request paramaters in the system's default encoding algorithm. If u call above method, I think it will solve ur problem. Praveen ----- Original Message ----- From: "Bruno Tirel" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Thursday, July 15, 2004 6:15 AM Subject: RE: Problems indexing Japanese with CJKAnalyzer Hi All, I am also trying to localize everything for French application, using UTF-8 encoding. I have already applied what Jon described. I fully confirm his recommandation for HTML Parser and HTML Document changes with UNICODE and "UTF-8" encoding specification. In my case, I have still one case not functional : using meta-data from HTML document, as in demo3 example. Trying to convert to "UTF-8", or "ISO-8859-1", it is still not correctly encoded when I check with Luke. A word "Propriété" is seen either as "Propri?t?" with a square, or as "Propriã©tã©". My local codepage is Cp1252, so should be viewed as ISO-8859-1. Same result when I use "local FileEncoding parameter. All the other fields are correctly encoded into UTF-8, tokenized and successfully searched through JSP page. Is anybody already facing this issue? Any help available? Best regards, Bruno -----Message d'origine----- De : Jon Schuster [mailto:[EMAIL PROTECTED] Envoyé : mercredi 14 juillet 2004 22:51 À : 'Lucene Users List' Objet : RE: Problems indexing Japanese with CJKAnalyzer Hi all, Thanks for the help on indexing Japanese documents. I eventually got things working, and here's an update so that other folks might have an easier time in similar situations. The problem I had was indeed with the encoding, but it was more than just the encoding on the initial creation of the HTMLParser (from the Lucene demo package). In HTMLDocument, doing this: InputStreamReader reader = new InputStreamReader( new FileInputStream(f), "SJIS"); HTMLParser parser = new HTMLParser( reader ); creates the parser and feeds it Unicode from the original Shift-JIS encoding document, but then when the document contents is fetched using this line: Field fld = Field.Text("contents", parser.getReader() ); HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter using the default encoding, which in my case was Windows 1252 (essentially Latin-1). That was bad. In the HTMLParser.jj grammar file, adding an explicit encoding of "UTF8" on both the Reader and Writer got things mostly working. The one missing piece was in the "options" section of the HTMLParser.jj file. The original grammar file generates an input character stream class that treats the input as a stream of 1-byte characters. To have JavaCC generate a stream class that handles double-byte characters, you need the option UNICODE_INPUT=true. So, there were essentially three changes in two files: HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit "UTF8" encoding on Reader and Writer creation in getReader(). As far as I can tell, this changes works fine for all of the languages I need to handle, which are English, French, German, and Japanese. HTMLDocument - add explicit encoding of "SJIS" when creating the Reader used to create the HTMLParser. (For western languages, I use encoding of "ISO8859_1".) And of course, use the right language tokenizer. --Jon <earlier responses snipped; see the list archive> --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]