Here is an interesting article about how to process Common Crawl data using Amazon EMR: http://www.commoncrawl.org/mapreduce-for-the-masses/
I think we should be able to do something similar with Whirr quite easily. I will give it a try soon. -- Andrei Savu
