Hi, I'm trying run org.apache.lucene.demo.IndexHTML on HTML files with UTF-8. In JavaCC FAQ, it's said that just UTF-8 Reader should be given to SimpleCharStream: http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-moz.htm#tth_sEc3.21 . I made the following change in HTMLDocument.java:
//old: FileInputStream fis = new FileInputStream(f); InputStreamReader isr = new InputStreamReader(new FileInputStream(f), "UTF-8"); HTMLParser parser = new HTMLParser(isr); It doesn't help. I don't know if it's true but this text is in SimpleCharStream: "An implementation of interface CharStream, where the stream is assumed to contain only ASCII characters (without unicode processing)." The error is like that: Parse Aborted: Lexical error at line 5, column 34. Encountered: "\u0411" (1041), after : "" This is right after <meta name="Author" content=" . Interestingly, there is <title> in cyrillic before that. Any ideas? Or I have to look at alternatives like those in http://wiki.apache.org/lucene-java/LuceneFAQ#head-e7d23f91df094d7baeceb46b04d518dc426d7d2e (How can I index HTML documents?) Regards, Ognyan Kulev --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]