Ever since working on Solvent and Piggy Bank, we have been toying with the idea of using the same javascript scrapers to power a server-side headless crawling agent that could perform data extraction and scraping in a more automated way.
The goal is far from simple: scrapers are inherently dependent not only on a javascript interpreter but also on the object model that modern browsers expose to the javascript environment and things like HTML parsing and DOM creation. Replicating such an environment on a server practically meant either to: 1) write and/or combine a software stack that performs the same function of a browser in terms of HTML parsing, DOM creation, javascript execution, etc... 2) find a way to use firefox's own code on the server Knowing how #1 requires probably years of polishing to reach the level that Firefox/Mozilla has reached over 8 years of development, I turned my attention to #2 and started working on using JavaXPCOM (a java->XPCOM bridge used in recent eclipse plugins for ajax support). Unfortunately, JavaXPCOM is a complete disaster (ATM) in terms of stability, documentation, community support and traction inside the mozilla development community. Basically, you're on your own and even when you're not, nobody really is considering using firefox xpcom libraries just for things like HTML parsing and javascript execution. So, I was waiting for the things on javaxpcom to solidify until yesterday, I had a idea: decouple the crawling logic from the actual page fetcher and implement a very minimal HTTP server in javascript that turns the web browser into a headless browsing web service. And so I did at http://simile.mit.edu/repository/crowbar/trunk/ you find Crowbar: a XUL application (basically a hyper-stripped-down firefox) that you can execute with XULRunner (basically the XUL equivalent of a java virtual machine) [see the README.txt for more info] and that you can use from a remote machine as a fetching and DOM serializing web service. Right now, it does not scrape, but it fetches the URL that you pass it thru a RESTful web service, it executes the javascript and builds the DOM, waits 3 seconds and returns you the serialization of the page DOM. Might seem rather pointless but this is a major milestone and here is why: 1) I'm able to obtain a serialized and guaranteed well-formed representation of any HTML page, no matter what complicated and no matter how much client side manipulation is present. This is not a way to use the browser's own internals instead of, say, wget, but a radically different approach to crawling. For example, the result of "wget http://maps.google.com/" is drastically different than the crowbar equivalent, due to all the javascript action that happen only on the client side! Here, as I'm in fact using a real browser to do the fetching, the result is precisely the same as if you were looking at the page. 2) executing a piggy bank scraper is now just a matter of writing glue code (most of which can be copied directly from piggy bank) as the execution environment is precisely the same (xulrunner and firefox share most of the same code, environmentally wise). 3) crowbar's web service will also perform query operations on the resulting DOM directly, for example as a way to obtain links it's sufficient to ask for the "//A" xpath. This will radically simplify the architecture of the crawling agents that will driver the fetching frontends. 4) crowbar is now automatically using all the caching mechanism that the browser uses. There is still a lot of work to be done before I can see people using this for real, but I wanted to advertise the fact that it's now starting to function and we have a clear design direction that is much easier and solid to work with so that other interested parties might come in and help out. Enjoy! -- Stefano Mazzocchi Digital Libraries Research Group Research Scientist Massachusetts Institute of Technology E25-131, 77 Massachusetts Ave skype: stefanomazzocchi Cambridge, MA 02139-4307, USA email: stefanom at mit . edu ------------------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]