As for alternative HTML parsers, there are a few notable ones:
NekoHTML - Nutch uses it
JTidy - My <index> Ant task in the sandbox uses it
and HTMLParser
All of the above are surely far more battle-tested in production than Lucene's demo parser, and I'd be surprised if they did not correctly handle Unicode.
Erik
On Sep 24, 2004, at 11:01 PM, Fred Toth wrote:
Hi,
Thanks for the tip, but that didn't work in my case. Presumably with this patch, and the changes in CVS, this makes the parser work with UTF-16. I can't really tell because the index appears now to be completely UTF-16 and I can't search for anything.
My input is actually UTF-8 anyway, and if I patch all the streams to use UTF-8 instead of UTF-16, I get parser errors.
So I'm stuck.
Thanks for your help,
Fred
At 09:46 PM 9/24/2004, [EMAIL PROTECTED] wrote:In org.apache.lucene.demo.HTMLDocument you need to change the input stream
to use a different encoding. Replace the fis with this:
fis = new InputStreamReader(new FileInputStream(f), "UTF-16");
-----Original Message----- From: Fred Toth [mailto:[EMAIL PROTECTED] Sent: Friday, September 24, 2004 9:25 PM To: Lucene Users List Subject: Re: demo IndexHTML parser breaks unicode?
Sorry, that didn't cure it.
Again, anyone want to point me to the quickest replacement HTML parser (that's unicode clean)?
Thanks,
Fred
At 03:17 PM 9/24/2004, you wrote:
>On Friday 24 September 2004 19:58, Fred Toth wrote:
>
> > I've got unicode in my source HTML. In particular, within meta tags,
> > and it's getting broken by the indexer. Note that I'm not trying to
> > query on any of this, just store and retrieve document titles with
> > unicode characters.
>
>Please try again with the code from CVS, Christoph Goller committed a fix
>for this problem (at least I think it was this problem) 1-3 weeks ago.
>
>Regards
> Daniel
>
>--
>http://www.danielnaber.de
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]