[ann] Crowbar's first milestone lands in Simile's subversion - a big step closer to a scraping crawler

Stefano Mazzocchi Thu, 22 Feb 2007 20:41:26 -0800

Ever since working on Solvent and Piggy Bank, we have been toying with
the idea of using the same javascript scrapers to power a server-side
headless crawling agent that could perform data extraction and scraping
in a more automated way.


The goal is far from simple: scrapers are inherently dependent not only
on a javascript interpreter but also on the object model that modern
browsers expose to the javascript environment and things like HTML
parsing and DOM creation.

Replicating such an environment on a server practically meant either to:

 1) write and/or combine a software stack that performs the same
function of a browser in terms of HTML parsing, DOM creation, javascript
execution, etc...

 2) find a way to use firefox's own code on the server

Knowing how #1 requires probably years of polishing to reach the level
that Firefox/Mozilla has reached over 8 years of development, I turned
my attention to #2 and started working on using JavaXPCOM (a java->XPCOM
bridge used in recent eclipse plugins for ajax support).

Unfortunately, JavaXPCOM is a complete disaster (ATM) in terms of
stability, documentation, community support and traction inside the
mozilla development community. Basically, you're on your own and even
when you're not, nobody really is considering using firefox xpcom
libraries just for things like HTML parsing and javascript execution.

So, I was waiting for the things on javaxpcom to solidify until
yesterday, I had a idea: decouple the crawling logic from the actual
page fetcher and implement a very minimal HTTP server in javascript that
turns the web browser into a headless browsing web service.

And so I did at

 http://simile.mit.edu/repository/crowbar/trunk/

you find Crowbar: a XUL application (basically a hyper-stripped-down
firefox) that you can execute with XULRunner (basically the XUL
equivalent of a java virtual machine) [see the README.txt for more info]
and that you can use from a remote machine as a fetching and DOM
serializing web service.

Right now, it does not scrape, but it fetches the URL that you pass it
thru a RESTful web service, it executes the javascript and builds the
DOM, waits 3 seconds and returns you the serialization of the page DOM.

Might seem rather pointless but this is a major milestone and here is why:

 1) I'm able to obtain a serialized and guaranteed well-formed
representation of any HTML page, no matter what complicated and no
matter how much client side manipulation is present. This is not a way
to use the browser's own internals instead of, say, wget, but a
radically different approach to crawling. For example, the result of
"wget http://maps.google.com/"; is drastically different than the crowbar
equivalent, due to all the javascript action that happen only on the
client side! Here, as I'm in fact using a real browser to do the
fetching, the result is precisely the same as if you were looking at the
page.

 2) executing a piggy bank scraper is now just a matter of writing glue
code (most of which can be copied directly from piggy bank) as the
execution environment is precisely the same (xulrunner and firefox share
most of the same code, environmentally wise).

 3) crowbar's web service will also perform query operations on the
resulting DOM directly, for example as a way to obtain links it's
sufficient to ask for the "//A" xpath. This will radically simplify the
architecture of the crawling agents that will driver the fetching frontends.

 4) crowbar is now automatically using all the caching mechanism that
the browser uses.

There is still a lot of work to be done before I can see people using
this for real, but I wanted to advertise the fact that it's now starting
to function and we have a clear design direction that is much easier and
solid to work with so that other interested parties might come in and
help out.

Enjoy!

-- 
Stefano Mazzocchi
Digital Libraries Research Group                 Research Scientist
Massachusetts Institute of Technology
E25-131, 77 Massachusetts Ave               skype: stefanomazzocchi
Cambridge, MA  02139-4307, USA         email: stefanom at mit . edu
-------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[ann] Crowbar's first milestone lands in Simile's subversion - a big step closer to a scraping crawler

Reply via email to