Hi,
I was hoping it wouldn't come to this:
I've got unicode in my source HTML. In particular, within meta tags, and it's getting broken by the indexer. Note that I'm not trying to query on any of this, just store and retrieve document titles with unicode characters.
Has anyone else experienced this? I know this is just a demo, but it's been working really well and I hate to give it up!
Is this easily fixable? I'm a little worried by this comment in SimpleCharStream.java:
/** * An implementation of interface CharStream, where the stream is assumed to * contain only ASCII characters (without unicode processing). */
This is likely a show-stopper for me on this parser.
Can anyone recommend the shortest path to another HTML parser that is unicode friendly?
Thanks for anything.
Fred
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]