Peter A. Daly wrote:
Based on my research of Nutch (I am by no means an expert), the problem is less storage than raw server count. I have read for a high performance index it is best to limit the index to 20m pages per
Actually, in order to get a sub-second response for several queries per second you would need to limit this to roughly 10-15mln pages per server (assuming 4GB Intel x86 servers) - you won't be able to use all RAM just for Java heap, various OS tasks need some space too...
"search" server. That means a 2b page index would need 100 search servers, probably each with 4gig RAM and a "normal" sized hard disk. I
... which would give you ~200 servers. For a whole web search engine this is not a very high number, if you compare it with 20,000+ servers at Google... ;-)
have read that an average page size is somewhere around 12k. Even if that is not quite accurate in Nutch, we are still talking a monstrous amount of storage per searcher machine. Additionally, I believe a big
My crawling experiments (for 10mln+ pages) put this somewhere between 15-20kB, depending how high you put the content size limit.
The storage per server machine is then easy to calculate. It equals to 200GB, which is peanuts - it fits on a single modern IDE or SATA drive; or perhaps 2x100GB in a hardware striping config.
server with massive storage is required to crawl then create and distribute the index. At 12k per page, the storage would be in the 23TB range, if my quick calculations are correct.
Yes, this is mostly correct. Let's assume we have a 40TB main storage (20kB page x 2bln pages). You can distribute it on NutchFileSystem cluster (with redundancy, let's say 10%), which you build out of commodity machines. If you put 4x250GB disks per machine (1TB), then assuming 40TB of storage + 4TB for redundancy you end up with 44 machines. Well, not a small number, but hey - we're talking 2 billion pages here, and you don't build an index of this size anyway without serious thinking about planning/ investment/ management/ operations/ whatnot ...
Now I guess the real questions are: 1. Is 20mil pages per searcher accurate for today's hardware with 4gb ram? 2. Can Nutch scale well to a 100 machine cluster?
We won't know until someone tries this... I can't see for the moment an inherent limit on the number of search servers, except for the traffic overhead, and a possible CPU/memory bottleneck for the search front-end - which could be solved by introducing intermediate search nodes for merging partial results.
3. Is a cluster of this size manageable?
Well, that's I think where the problem lies - we definitely need better tools for deployment, monitoring and load-balancing search servers. I'm trying to fill this gap in the meantime until we grow a JMX or somesuch subsystem, but a more systematic and thorough refactoring is still needed. Also, the database management (generating, updating, analyzing) requires extremely high I/O subsystem - whether NutchFileSystem helps here I'm not sure, I will have to check it myself... ;-)
Can someone else back up or correct my estimates?
Yes, they are mostly correct, assuming the base calculations that you can see on Wiki are correct. In the nearest future I'm going to do some performance testing on a 15-20mln pages index, I'll report the numbers to the list then.
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
------------------------------------------------------- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD - A multidimensional database that combines robust object and relational technologies, making it a perfect match for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
