demo IndexHTML parser breaks unicode?

Fred Toth Fri, 24 Sep 2004 10:59:41 -0700

Hi,

I was hoping it wouldn't come to this:

I've got unicode in my source HTML. In particular, within meta tags,
and it's getting broken by the indexer. Note that I'm not trying to
query on any of this, just store and retrieve document titles with
unicode characters.

Has anyone else experienced this? I know this is just a demo, but
it's been working really well and I hate to give it up!

Is this easily fixable? I'm a little worried by this comment in
SimpleCharStream.java:

/**
 * An implementation of interface CharStream, where the stream is assumed to
 * contain only ASCII characters (without unicode processing).
 */

This is likely a show-stopper for me on this parser.

Can anyone recommend the shortest path to another HTML parser
that is unicode friendly?

Thanks for anything.

Fred


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

demo IndexHTML parser breaks unicode?

Reply via email to