I wouldn't go so far as to call it stupid, but I wouldn't mind
having an html parser not built on DOM. Meta info can still
be gotten without a full DOM parse. Boosting phrases within
certain tags (H1,H2,...) would be nice, but it won't necessarily
be useful for everyone, and we aren't doing it right now anyway.
If you feel strongly about it, why don't you write another
parse filter, something like parse-html-lite? People can then
choose which to use.
By the way, how are you doing stuff like removing repeated
tokens? It's a problem that I'm interested in also.
Howie
Fuad Efendi wrote:
But we do not need 'better parsing of malformed html', we need only to
extract plain text... Yes, meta-information such as Creative Commons
embedde
XML in HTML comments is important too, and plugin technics does the job
very
well.
I am only trying to focus on specific task, such as removal of repeated
tokens (menu items, options, ...), automatic web-tree building using
anchors
and some statistics, calculating rank for repeated tokens and indexing
only
specific sentences with low rank. I simply ignore DOM/SAX, I don't need
it.
-----Original Message-----
From: Jérôme Charron [mailto:[EMAIL PROTECTED] Sent: Saturday,
February 25, 2006 4:05 AM
To: [email protected]
Subject: Re: Nutch Improvement - HTML Parser
It's not a tool,
IT IS stupidness of Nutch, it uses DOM just to extract plain text and
Outlink[]...
It's very easy to design specific routine to 'parse' byte[], we can
improve
everything 100 times... At Least!
Yes sure. I think everybody has already done such things at school...
Building a DOM provide:
1. a better parsing of malformed html documents (and there is a lot of
malformed docs on the web)
2. gives ability to extract meta-information such as creative commons
license
3. gives a high degree of extensibility (HtmlParser extension point) to
extract some specific informations without parsing the document many times
(for instance extracting technorati like tags, ...) and just providing a
simple plugin.
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers