With this configuration

1. Is this just for the indexing or does it provide the interface as well?
2. How may users would a system like this support?
3. Does this allow for a cache like google?
Thanks
--------------------
Joshua Oliver - [EMAIL PROTECTED]
Help out on the best open source project - Sojourner
http://sourceforge.net/projects/sojourner/



-----Original Message-----
   >From: "Andrzej Bialecki"<[EMAIL PROTECTED]>
   >Sent: 17/11/04 9:02:35 AM
   >To: "[EMAIL PROTECTED]"<[EMAIL PROTECTED]>
   >Subject: Re: [Nutch-dev] Commercial server
     >Peter A. Daly wrote:
   >
   >> Based on my research of Nutch (I am by no means an expert), the problem 
   >> is less storage than raw server count.  I have read for a high 
   >> performance index it is best to limit the index to 20m pages per 
   >
   >Actually, in order to get a sub-second response for several queries per 
   >second you would need to limit this to roughly 10-15mln pages per server 
   >(assuming 4GB Intel x86 servers) - you won't be able to use all RAM just 
   >for Java heap, various OS tasks need some space too...
   >
   >> "search" server.  That means a 2b page index would need 100 search 
   >> servers, probably each with 4gig RAM and a "normal" sized hard disk.  I 
   >
   >... which would give you ~200 servers. For a whole web search engine 
   >this is not a very high number, if you compare it with 20,000+ servers 
   >at Google... ;-)
   >
   >> have read that an average page size is somewhere around 12k.  Even if 
   >> that is not quite accurate in Nutch, we are still talking a monstrous 
   >> amount of storage per searcher machine.  Additionally, I believe a big 
   >
   >My crawling experiments (for 10mln+ pages) put this somewhere between 
   >15-20kB, depending how high you put the content size limit.
   >
   >The storage per server machine is then easy to calculate. It equals to 
   >200GB, which is peanuts - it fits on a single modern IDE or SATA drive; 
   >or perhaps 2x100GB in a hardware striping config.
   >
   >> server with massive storage is required to crawl then create and 
   >> distribute the index.  At 12k per page, the storage would be in the 23TB 
   >> range, if my quick calculations are correct.
   >
   >Yes, this is mostly correct. Let's assume we have a 40TB main storage 
   >(20kB page x 2bln pages). You can distribute it on NutchFileSystem 
   >cluster (with redundancy, let's say 10%), which you build out of 
   >commodity machines. If you put 4x250GB disks per machine (1TB), then 
   >assuming 40TB of storage + 4TB for redundancy you end up with 44 
   >machines. Well, not a small number, but hey - we're talking 2 billion 
   >pages here, and you don't build an index of this size anyway without 
   >serious thinking about planning/ investment/ management/ operations/ 
   >whatnot ...
   >
   >> 
   >> Now I guess the real questions are:
   >> 1.  Is 20mil pages per searcher accurate for today's hardware with 4gb 
ram?
   >> 2.  Can Nutch scale well to a 100 machine cluster?
   >
   >We won't know until someone tries this... I can't see for the moment an 
   >inherent limit on the number of search servers, except for the traffic 
   >overhead, and a possible CPU/memory bottleneck for the search front-end 
   >- which could be solved by introducing intermediate search nodes for 
   >merging partial results.
   >
   >> 3.  Is a cluster of this size manageable?
   >
   >Well, that's I think where the problem lies - we definitely need better 
   >tools for deployment, monitoring and load-balancing search servers. I'm 
   >trying to fill this gap in the meantime until we grow a JMX or somesuch 
   >subsystem, but a more systematic and thorough refactoring is still 
   >needed. Also, the database management (generating, updating, analyzing) 
   >requires extremely high I/O subsystem - whether NutchFileSystem helps 
   >here I'm not sure, I will have to check it myself... ;-)
   >
   >> Can someone else back up or correct my estimates?
   >
   >Yes, they are mostly correct, assuming the base calculations that you 
   >can see on Wiki are correct. In the nearest future I'm going to do some 
   >performance testing on a 15-20mln pages index, I'll report the numbers 
   >to the list then.
   >
   >-- 
   >Best regards,
   >Andrzej Bialecki
   >
   >-------------------------------------------------
   >Software Architect, System Integration Specialist
   >CEN/ISSS EC Workshop, ECIMF project chair
   >EU FP6 E-Commerce Expert/Evaluator
   >-------------------------------------------------
   >FreeBSD developer (http://www.freebsd.org)
   >
   >
   >
   >-------------------------------------------------------
   >This SF.Net email is sponsored by: InterSystems CACHE
   >FREE OODBMS DOWNLOAD - A multidimensional database that combines
   >robust object and relational technologies, making it a perfect match
   >for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
   >_______________________________________________
   >Nutch-developers mailing list
   >[EMAIL PROTECTED]
   >https://lists.sourceforge.net/lists/listinfo/nutch-developers
   >
   >
   >



-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to