[Nutch-dev] RE: Nutch Improvement - HTML Parser

Fuad Efendi Sat, 25 Feb 2006 15:31:02 -0800

Let's do this, to create /to use existing/ low-level processing, I mean to
use StartTag and EndTag (which could be different in case of malformed
HTML), and to look at what is inside.


In this case performance wil improve, and functionality, because we are not
building DOM, and we are not trying to find and fix HTML errors. Of course
our Tag class will have Attributes, and we will have StartTag, EndTag, etc.
I call it low-level 'parsing'. Are we using DOM to parse RTF, PDF, XLS, TXT?
Even inside existing parser we are using Perl5 to check some metadata, right
before parsing.


=====
Yes sure. I think everybody has already done such things at school...
Building a DOM provide:
1. a better parsing of malformed html documents (and there is a lot of
malformed docs on the web)
2. gives ability to extract meta-information such as creative commons
license
3. gives a high degree of extensibility (HtmlParser extension point) to
extract some specific informations without parsing the document many times
(for instance extracting technorati like tags, ...) and just providing a
simple plugin.



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] RE: Nutch Improvement - HTML Parser

Reply via email to