HTMLParser and Chinese

Jennifer May Fri, 14 Sep 2007 03:24:14 -0700

Hello!

I want to index an HTML document with the lucene demo, but have problemsparsing some Chinese files.

I changed code in the HTMLDocument class as to be able to define theencoding of the document to be parsed:InputStreamReader fis = new InputStreamReader(new FileInputStream(f),IndexHTML.encoding);

HTMLParser parser = new HTMLParser(fis);

It works fine for most of my files in GB, Big5 or UTF-8. However, I getthe following exception for some of my files:Parse Aborted: Lexical error at line 6, column 24. Encountered: "\u4f53"(20307), after : ""


The HTML document looks like this:

<HTML><HEAD><meta http-equiv="Content-Type" content="text/html; 
charset=GB2312"><TITLE>刘先生(阿成)</TITLE>
<META NAME="keywords" CONTENT="阿成 魂游天国 刘先生">...

Obviously, the Chinese in the meta-tag is the problem. But why? And howto solve it?

JTidy parses the same file without errors, but than I have problems withthe indexing as the JTidyparser takes only InputStreams withoutspecified encoding, not InputStreamReaders (at least as far as I foundout). Even if I convert my file from the original GB to UTF-8 I get onlygibberish in the Lucene index when using JTidy for parsing.

Thanks in advance for any suggestions either to get around theHTMLParser problem or get JTidy to handle different encodings,

Jenny

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

HTMLParser and Chinese

Reply via email to