The following module was proposed for inclusion in the Module List:
modid: WWW::HtmlUnit::Spidey
DSLIP: adpho
description: Web scraping library, scalable, JS support
userid: NINUZZO (Antonio Bonifati)
chapterid: 15 (World_Wide_Web_HTML_HTTP_CGI)
communities:
similar:
WWW::HtmlUnit::Sweet
rationale:
This module builds upon WWW::HtmlUnit to provide an easy-to-use
interface to the Java web scraping library HtmlUnit. Thus it is
appropriate to put it under the WWW::HtmlUnit namespace.
My approach was to use multiple programming paradigms (functional,
declarative and object based) to devise a Domain Specific Language
for writing scalable web crawlers with some good JavaScript support,
which ATTOW is lacking in every other Perl web scraping toolkit,
except WWW::HtmlUnit::Sweet.
I have asked Brock Wilcox <[email protected]> for
permission to use his namespace prefix WWW::HtmlUnit and he agreed.
He reckons Spidey different enough from WWW::HtmlUnit::Sweet and a
welcoming alternative.
In fact I departed from any Mechanize-like syntax for good reasons:
* a multi-paradigm DSL would produce spiders easier to develop,
maintain and debug * miming the Mechanize interface would be
restrictive unless one extends it to provide additional features,
but then it would become incompatible * HtmlUnit is quite different
from Mechanize, fitting the interface of the former into the latter
would just be a twist with no advantages * interchangeability with
Mechanize is not possible, because spiders written with Spidey will
usually rely on JavaScript support, something that Mechanize does
not have and won't have in the near future. E.g. if JavaScript is
needed to submit a form, Mechanize can not handle it directly while
Spidey will, without requiring you to write additional code into
your spider to emulate the JS behaviour.
Comparison with WWW::HtmlUnit::Sweet:
Both Sweet and Spidey support JS through HtmlUnit but while the
former is targeted to web testing, the latter is specific to web
harvesting. Infact Spidey is not only an headless browser with JS
support, but also offers some facilities for data extraction,
conversion, logging and debugging. All these feature are needed to
write batch-mode robust web scrapers to harvest data from the
currently unstructured WWW.
enteredby: NINUZZO (Antonio Bonifati)
enteredon: Sat Mar 12 18:54:39 2011 GMT
The resulting entry would be:
WWW::HtmlUnit::
::Spidey adpho Web scraping library, scalable, JS support NINUZZO
Thanks for registering,
--
The PAUSE
PS: The following links are only valid for module list maintainers:
Registration form with editing capabilities:
https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=d6500000_b3ea9e859868b6e2&SUBMIT_pause99_add_mod_preview=1
Immediate (one click) registration:
https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=d6500000_b3ea9e859868b6e2&SUBMIT_pause99_add_mod_insertit=1
Peek at the current permissions:
https://pause.perl.org/pause/authenquery?pause99_peek_perms_by=me&pause99_peek_perms_query=WWW%3A%3AHtmlUnit%3A%3ASpidey