Hi!
Is there any HTMLDocument out there? The one in the demo package of lucene
does not handle non-wellformed HTML files (what about nekohtml?) and seems to
have some other inabilities and bugs as well (and why isn't it part of the
distro but in a demo package?!)?
Nutch uses NekoHTML, so you can browse around that codebase and borrow its implementation. The sandbox has a contribution/ant directory which contains an HTMLDocument that uses JTidy to parse HTML which does a pretty good job at handling bad HTML.
Why isn't it in the distribution? Parsing HTML and turning it into a Lucene document is not always done the same way and doing so is really on top of the core, not integral to it.
Erik
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]