Hi, Here's another Q about a wide, large-scale crawl resource requirements on EC2 - primarily storage and bandwidth needs. Please correct any mistakes you see. I'll use 500M pages as the crawl target. I'll assume 10 KB/page on average.
500M pages * 10 KB/page = 5000 GB, which is 5 TB 5 TB is the size of just the raw fetched pages. Q: - What about any overhead besides the obvious replication factor, such as sizes of linkdb and crawldb, any temporary data, any non-raw data in HDFS, and such? - If parsed data is stored in addition to raw data, can we assume the parsed content will be up to 50% of the raw fetched data? Here are some calculations: - 50 small EC2 instances at 0.085/hour give us 160 GB * 50 = 8 TB for $714/week - 50 large EC2 instances at 0.34/hour give us 850 GB * 50 = 42 TB for $2856/week (we can lower the cost by using Spot instances, but I'm just trying to keep this simple for now) Sounds like either one needs more smaller instances (which should make fetching faster) or one needs to use large instances to be able to store 500M pages + any overhead. I'm assuming 42 TB is enough for that.... is it? Bandwidth is relatively cheap: At $0.1 / GB for IN data, 5000 GB * $0.1 = $500 What mistakes did I make above? Did I miss anything important? Thanks, Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/

