On Feb 1, 2004, at 6:19 AM, [EMAIL PROTECTED] wrote:
Hi!

Is there any HTMLDocument out there? The one in the demo package of lucene
does not handle non-wellformed HTML files (what about nekohtml?) and seems to
have some other inabilities and bugs as well (and why isn't it part of the
distro but in a demo package?!)?

Nutch uses NekoHTML, so you can browse around that codebase and borrow its implementation. The sandbox has a contribution/ant directory which contains an HTMLDocument that uses JTidy to parse HTML which does a pretty good job at handling bad HTML.


Why isn't it in the distribution? Parsing HTML and turning it into a Lucene document is not always done the same way and doing so is really on top of the core, not integral to it.

Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to