As for alternative HTML parsers, there are a few notable ones:

NekoHTML - Nutch uses it

JTidy - My <index> Ant task in the sandbox uses it

and HTMLParser

All of the above are surely far more battle-tested in production than Lucene's demo parser, and I'd be surprised if they did not correctly handle Unicode.

        Erik


On Sep 24, 2004, at 11:01 PM, Fred Toth wrote:

Hi,

Thanks for the tip, but that didn't work in my case. Presumably
with this patch, and the changes in CVS, this makes the parser
work with UTF-16. I can't really tell because the index appears
now to be completely UTF-16 and I can't search for anything.

My input is actually UTF-8 anyway, and if I patch all the streams
to use UTF-8 instead of UTF-16, I get parser errors.

So I'm stuck.

Thanks for your help,

Fred

At 09:46 PM 9/24/2004, [EMAIL PROTECTED] wrote:
In org.apache.lucene.demo.HTMLDocument you need to change the input stream
to use a different encoding. Replace the fis with this:


fis = new InputStreamReader(new FileInputStream(f), "UTF-16");

-----Original Message-----
From: Fred Toth [mailto:[EMAIL PROTECTED]
Sent: Friday, September 24, 2004 9:25 PM
To: Lucene Users List
Subject: Re: demo IndexHTML parser breaks unicode?


Sorry, that didn't cure it.

Again, anyone want to point me to the quickest replacement
HTML parser (that's unicode clean)?

Thanks,

Fred

At 03:17 PM 9/24/2004, you wrote:
>On Friday 24 September 2004 19:58, Fred Toth wrote:
>
> > I've got unicode in my source HTML. In particular, within meta tags,
> > and it's getting broken by the indexer. Note that I'm not trying to
> > query on any of this, just store and retrieve document titles with
> > unicode characters.
>
>Please try again with the code from CVS, Christoph Goller committed a fix
>for this problem (at least I think it was this problem) 1-3 weeks ago.
>
>Regards
> Daniel
>
>--
>http://www.danielnaber.de
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to