Demo HTML parser doesn't work for international documents
---------------------------------------------------------
Key: LUCENE-589
URL: http://issues.apache.org/jira/browse/LUCENE-589
Project: Lucene - Java
Type: Bug
Components: Examples
Versions: 2.0.0
Reporter: Curtis d'Entremont
Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it
would read the charset from the HTML markup, but that can by tricky. For now
assuming unicode would do the trick:
Add the following line marked with a + to HTMLParser.jj:
options {
STATIC = false;
OPTIMIZE_TOKEN_MANAGER = true;
//DEBUG_LOOKAHEAD = true;
//DEBUG_TOKEN_MANAGER = true;
+ UNICODE_INPUT = true;
}
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]