Based on my research of Nutch (I am by no means an expert), the problem is less storage than raw server count. I have read for a high performance index it is best to limit the index to 20m pages per "search" server. That means a 2b page index would need 100 search servers, probably each with 4gig RAM and a "normal" sized hard disk. I have read that an average page size is somewhere around 12k. Even if that is not quite accurate in Nutch, we are still talking a monstrous amount of storage per searcher machine. Additionally, I believe a big server with massive storage is required to crawl then create and distribute the index. At 12k per page, the storage would be in the 23TB range, if my quick calculations are correct.

Now I guess the real questions are:
1. Is 20mil pages per searcher accurate for today's hardware with 4gb ram?
2. Can Nutch scale well to a 100 machine cluster?
3. Is a cluster of this size manageable?


Can someone else back up or correct my estimates?

-Pete

On Nov 16, 2004, at 3:02 PM, Joshua Oliver wrote:

Hello all

I am looking into the possibility of creating a commercial complete web (like yahoo or Google) search engine using Nutch.
If we were to index 2 billion pages to start with how much server storage would we need for the index?
And is Nutch currently scalable enough to do this?
Please Reply


Regards,
Joshua Oliver




------------------------------------------------------- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD - A multidimensional database that combines robust object and relational technologies, making it a perfect match for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers



------------------------------------------------------- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD - A multidimensional database that combines robust object and relational technologies, making it a perfect match for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to