EC2 storage needs for 500M URL crawl?

Otis Gospodnetic Wed, 09 Mar 2011 08:46:13 -0800

Hi,

Here's another Q about a wide, large-scale crawl resource requirements on EC2 - 
primarily storage and bandwidth needs.
Please correct any mistakes you see.
I'll use 500M pages as the crawl target.
I'll assume 10 KB/page on average.


500M pages * 10 KB/page = 5000 GB, which is 5 TB

5 TB is the size of just the raw fetched pages.

Q:
- What about any overhead besides the obvious replication factor, such as sizes 
of linkdb and crawldb, any temporary data, any non-raw data in HDFS, and such?
- If parsed data is stored in addition to raw data, can we assume the parsed 
content will be up to 50% of the raw fetched data?

Here are some calculations:

- 50 small EC2 instances at 0.085/hour give us 160 GB *  50 = 8 TB for $714/week
- 50 large EC2 instances at 0.34/hour give us 850 GB *  50 = 42 TB for 
$2856/week
(we can lower the cost by using Spot instances, but I'm just trying to keep 
this 
simple for now)

Sounds like either one needs more smaller instances (which should make fetching 
faster) or one needs to use large instances to be able to store 500M pages + 
any 
overhead.  I'm assuming 42 TB is enough for that.... is it?

Bandwidth is relatively cheap:
At $0.1 / GB for IN data, 5000 GB * $0.1 = $500

What mistakes did I make above?
Did I miss anything important?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

EC2 storage needs for 500M URL crawl?

Reply via email to