With this configuration
1. Is this just for the indexing or does it provide the interface as well?
2. How may users would a system like this support?
3. Does this allow for a cache like google?
Thanks
--------------------
Joshua Oliver - [EMAIL PROTECTED]
Help out on the best open source project - Sojourner
http://sourceforge.net/projects/sojourner/
-----Original Message-----
>From: "Andrzej Bialecki"<[EMAIL PROTECTED]>
>Sent: 17/11/04 9:02:35 AM
>To: "[EMAIL PROTECTED]"<[EMAIL PROTECTED]>
>Subject: Re: [Nutch-dev] Commercial server
>Peter A. Daly wrote:
>
>> Based on my research of Nutch (I am by no means an expert), the problem
>> is less storage than raw server count. I have read for a high
>> performance index it is best to limit the index to 20m pages per
>
>Actually, in order to get a sub-second response for several queries per
>second you would need to limit this to roughly 10-15mln pages per server
>(assuming 4GB Intel x86 servers) - you won't be able to use all RAM just
>for Java heap, various OS tasks need some space too...
>
>> "search" server. That means a 2b page index would need 100 search
>> servers, probably each with 4gig RAM and a "normal" sized hard disk. I
>
>... which would give you ~200 servers. For a whole web search engine
>this is not a very high number, if you compare it with 20,000+ servers
>at Google... ;-)
>
>> have read that an average page size is somewhere around 12k. Even if
>> that is not quite accurate in Nutch, we are still talking a monstrous
>> amount of storage per searcher machine. Additionally, I believe a big
>
>My crawling experiments (for 10mln+ pages) put this somewhere between
>15-20kB, depending how high you put the content size limit.
>
>The storage per server machine is then easy to calculate. It equals to
>200GB, which is peanuts - it fits on a single modern IDE or SATA drive;
>or perhaps 2x100GB in a hardware striping config.
>
>> server with massive storage is required to crawl then create and
>> distribute the index. At 12k per page, the storage would be in the 23TB
>> range, if my quick calculations are correct.
>
>Yes, this is mostly correct. Let's assume we have a 40TB main storage
>(20kB page x 2bln pages). You can distribute it on NutchFileSystem
>cluster (with redundancy, let's say 10%), which you build out of
>commodity machines. If you put 4x250GB disks per machine (1TB), then
>assuming 40TB of storage + 4TB for redundancy you end up with 44
>machines. Well, not a small number, but hey - we're talking 2 billion
>pages here, and you don't build an index of this size anyway without
>serious thinking about planning/ investment/ management/ operations/
>whatnot ...
>
>>
>> Now I guess the real questions are:
>> 1. Is 20mil pages per searcher accurate for today's hardware with 4gb
ram?
>> 2. Can Nutch scale well to a 100 machine cluster?
>
>We won't know until someone tries this... I can't see for the moment an
>inherent limit on the number of search servers, except for the traffic
>overhead, and a possible CPU/memory bottleneck for the search front-end
>- which could be solved by introducing intermediate search nodes for
>merging partial results.
>
>> 3. Is a cluster of this size manageable?
>
>Well, that's I think where the problem lies - we definitely need better
>tools for deployment, monitoring and load-balancing search servers. I'm
>trying to fill this gap in the meantime until we grow a JMX or somesuch
>subsystem, but a more systematic and thorough refactoring is still
>needed. Also, the database management (generating, updating, analyzing)
>requires extremely high I/O subsystem - whether NutchFileSystem helps
>here I'm not sure, I will have to check it myself... ;-)
>
>> Can someone else back up or correct my estimates?
>
>Yes, they are mostly correct, assuming the base calculations that you
>can see on Wiki are correct. In the nearest future I'm going to do some
>performance testing on a 15-20mln pages index, I'll report the numbers
>to the list then.
>
>--
>Best regards,
>Andrzej Bialecki
>
>-------------------------------------------------
>Software Architect, System Integration Specialist
>CEN/ISSS EC Workshop, ECIMF project chair
>EU FP6 E-Commerce Expert/Evaluator
>-------------------------------------------------
>FreeBSD developer (http://www.freebsd.org)
>
>
>
>-------------------------------------------------------
>This SF.Net email is sponsored by: InterSystems CACHE
>FREE OODBMS DOWNLOAD - A multidimensional database that combines
>robust object and relational technologies, making it a perfect match
>for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
>_______________________________________________
>Nutch-developers mailing list
>[EMAIL PROTECTED]
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>
>
-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers