[Nutch-dev] Re: Nutch Improvement - HTML Parser

Jérôme Charron Sat, 25 Feb 2006 01:05:03 -0800

> It's not a tool,
> IT IS stupidness of Nutch, it uses DOM just to extract plain text and
> Outlink[]...
> It's very easy to design specific routine to 'parse' byte[], we can
> improve
> everything 100 times... At Least!


Yes sure. I think everybody has already done such things at school...
Building a DOM provide:
1. a better parsing of malformed html documents (and there is a lot of
malformed docs on the web)
2. gives ability to extract meta-information such as creative commons
license
3. gives a high degree of extensibility (HtmlParser extension point) to
extract some specific informations without parsing the document many times
(for instance extracting technorati like tags, ...) and just providing a
simple plugin.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

[Nutch-dev] Re: Nutch Improvement - HTML Parser

Reply via email to