first using the word stupid without understand all the pros and cons is not
helpful the least. In addition to that benefits Jerome wrote, using DOM
allows you to use XSLT templates to extract information in a more
declarative way, not to mention standard way.

--Ragy

On 2/25/06, Jérôme Charron <[EMAIL PROTECTED]> wrote:
>
> > It's not a tool,
> > IT IS stupidness of Nutch, it uses DOM just to extract plain text and
> > Outlink[]...
> > It's very easy to design specific routine to 'parse' byte[], we can
> > improve
> > everything 100 times... At Least!
>
> Yes sure. I think everybody has already done such things at school...
> Building a DOM provide:
> 1. a better parsing of malformed html documents (and there is a lot of
> malformed docs on the web)
> 2. gives ability to extract meta-information such as creative commons
> license
> 3. gives a high degree of extensibility (HtmlParser extension point) to
> extract some specific informations without parsing the document many times
> (for instance extracting technorati like tags, ...) and just providing a
> simple plugin.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>

Reply via email to