Module submission WWW::HtmlUnit::Spidey

Perl Authors Upload Server Sat, 12 Mar 2011 10:55:17 -0800

The following module was proposed for inclusion in the Module List:

  modid:       WWW::HtmlUnit::Spidey
  DSLIP:       adpho
  description: Web scraping library, scalable, JS support
  userid:      NINUZZO (Antonio Bonifati)
  chapterid:   15 (World_Wide_Web_HTML_HTTP_CGI)
  communities:


  similar:
    WWW::HtmlUnit::Sweet

  rationale:

    This module builds upon WWW::HtmlUnit to provide an easy-to-use
    interface to the Java web scraping library HtmlUnit. Thus it is
    appropriate to put it under the WWW::HtmlUnit namespace.

    My approach was to use multiple programming paradigms (functional,
    declarative and object based) to devise a Domain Specific Language
    for writing scalable web crawlers with some good JavaScript support,
    which ATTOW is lacking in every other Perl web scraping toolkit,
    except WWW::HtmlUnit::Sweet.

    I have asked Brock Wilcox <[email protected]> for
    permission to use his namespace prefix WWW::HtmlUnit and he agreed.

    He reckons Spidey different enough from WWW::HtmlUnit::Sweet and a
    welcoming alternative.

    In fact I departed from any Mechanize-like syntax for good reasons:

    * a multi-paradigm DSL would produce spiders easier to develop,
    maintain and debug * miming the Mechanize interface would be
    restrictive unless one extends it to provide additional features,
    but then it would become incompatible * HtmlUnit is quite different
    from Mechanize, fitting the interface of the former into the latter
    would just be a twist with no advantages * interchangeability with
    Mechanize is not possible, because spiders written with Spidey will
    usually rely on JavaScript support, something that Mechanize does
    not have and won't have in the near future. E.g. if JavaScript is
    needed to submit a form, Mechanize can not handle it directly while
    Spidey will, without requiring you to write additional code into
    your spider to emulate the JS behaviour.

    Comparison with WWW::HtmlUnit::Sweet:

    Both Sweet and Spidey support JS through HtmlUnit but while the
    former is targeted to web testing, the latter is specific to web
    harvesting. Infact Spidey is not only an headless browser with JS
    support, but also offers some facilities for data extraction,
    conversion, logging and debugging. All these feature are needed to
    write batch-mode robust web scrapers to harvest data from the
    currently unstructured WWW.

  enteredby:   NINUZZO (Antonio Bonifati)
  enteredon:   Sat Mar 12 18:54:39 2011 GMT

The resulting entry would be:

WWW::HtmlUnit::
::Spidey          adpho Web scraping library, scalable, JS support   NINUZZO


Thanks for registering,
-- 
The PAUSE

PS: The following links are only valid for module list maintainers:

Registration form with editing capabilities:
  
https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=d6500000_b3ea9e859868b6e2&SUBMIT_pause99_add_mod_preview=1
Immediate (one click) registration:
  
https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=d6500000_b3ea9e859868b6e2&SUBMIT_pause99_add_mod_insertit=1
Peek at the current permissions:
  
https://pause.perl.org/pause/authenquery?pause99_peek_perms_by=me&pause99_peek_perms_query=WWW%3A%3AHtmlUnit%3A%3ASpidey

Module submission WWW::HtmlUnit::Spidey

Reply via email to