But we do not need 'better parsing of malformed html', we need only to extract plain text... Yes, meta-information such as Creative Commons embedde XML in HTML comments is important too, and plugin technics does the job very well.
I am only trying to focus on specific task, such as removal of repeated tokens (menu items, options, ...), automatic web-tree building using anchors and some statistics, calculating rank for repeated tokens and indexing only specific sentences with low rank. I simply ignore DOM/SAX, I don't need it. -----Original Message----- From: Jérôme Charron [mailto:[EMAIL PROTECTED] Sent: Saturday, February 25, 2006 4:05 AM To: [email protected] Subject: Re: Nutch Improvement - HTML Parser > It's not a tool, > IT IS stupidness of Nutch, it uses DOM just to extract plain text and > Outlink[]... > It's very easy to design specific routine to 'parse' byte[], we can > improve > everything 100 times... At Least! Yes sure. I think everybody has already done such things at school... Building a DOM provide: 1. a better parsing of malformed html documents (and there is a lot of malformed docs on the web) 2. gives ability to extract meta-information such as creative commons license 3. gives a high degree of extensibility (HtmlParser extension point) to extract some specific informations without parsing the document many times (for instance extracting technorati like tags, ...) and just providing a simple plugin. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
