IndexHTML and UTF-8

Ognyan Kulev Fri, 30 Nov 2007 03:34:28 -0800

Hi,

I'm trying run org.apache.lucene.demo.IndexHTML on HTML files with
UTF-8. In JavaCC FAQ, it's said that just UTF-8 Reader should be given
to SimpleCharStream:
http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-moz.htm#tth_sEc3.21 .
I made the following change in HTMLDocument.java:


//old: FileInputStream fis = new FileInputStream(f);
InputStreamReader isr = new InputStreamReader(new FileInputStream(f),
"UTF-8");
HTMLParser parser = new HTMLParser(isr);

It doesn't help. I don't know if it's true but this text is in
SimpleCharStream:

"An implementation of interface CharStream, where the stream is assumed
to contain only ASCII characters (without unicode processing)."

The error is like that:

Parse Aborted: Lexical error at line 5, column 34.  Encountered:
"\u0411" (1041), after : ""

This is right after <meta name="Author" content=" . Interestingly, there
is <title> in cyrillic before that.

Any ideas? Or I have to look at alternatives like those in
http://wiki.apache.org/lucene-java/LuceneFAQ#head-e7d23f91df094d7baeceb46b04d518dc426d7d2e
(How can I index HTML documents?)

Regards,
Ognyan Kulev



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

IndexHTML and UTF-8

Reply via email to