> It's not a tool, > IT IS stupidness of Nutch, it uses DOM just to extract plain text and > Outlink[]... > It's very easy to design specific routine to 'parse' byte[], we can > improve > everything 100 times... At Least!
Yes sure. I think everybody has already done such things at school... Building a DOM provide: 1. a better parsing of malformed html documents (and there is a lot of malformed docs on the web) 2. gives ability to extract meta-information such as creative commons license 3. gives a high degree of extensibility (HtmlParser extension point) to extract some specific informations without parsing the document many times (for instance extracting technorati like tags, ...) and just providing a simple plugin. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
