Dennis Kubes wrote:
> In the beginning it is approximately 10 to 1.  So for every page I crawl 
> I will get 10 more pages to crawl that are not currently in the index. 
> As you move towards 50 million pages is becomes more like 6 to 1.  If 
> you seed the entire dmoz, your first crawl will be around 5.5 million 
> pages.  Your second crawl will be around 54 million pages.  And a depth 
> of 3 will give you over 300 million pages.  These are the numbers that 
> we are currently seeing.

Be advised though that any crawl run that collects more than 1 mln pages 
is bound to collect a LOT of utter junk and spam - unless you tightly 
control the quality of URLs, using URLFilters, ScoringFilters and other 
means.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to