Hello!

I want to index an HTML document with the lucene demo, but have problems parsing some Chinese files.

I changed code in the HTMLDocument class as to be able to define the encoding of the document to be parsed: InputStreamReader fis = new InputStreamReader(new FileInputStream(f), IndexHTML.encoding);
HTMLParser parser = new HTMLParser(fis);

It works fine for most of my files in GB, Big5 or UTF-8. However, I get the following exception for some of my files: Parse Aborted: Lexical error at line 6, column 24. Encountered: "\u4f53" (20307), after : ""

The HTML document looks like this:

<HTML><HEAD><meta http-equiv="Content-Type" content="text/html; 
charset=GB2312"><TITLE>刘先生(阿成)</TITLE>
<META NAME="keywords" CONTENT="阿成 魂游天国 刘先生">...

Obviously, the Chinese in the meta-tag is the problem. But why? And how to solve it?

JTidy parses the same file without errors, but than I have problems with the indexing as the JTidyparser takes only InputStreams without specified encoding, not InputStreamReaders (at least as far as I found out). Even if I convert my file from the original GB to UTF-8 I get only gibberish in the Lucene index when using JTidy for parsing.

Thanks in advance for any suggestions either to get around the HTMLParser problem or get JTidy to handle different encodings,
Jenny

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to