Dennis Kubes wrote: > In the beginning it is approximately 10 to 1. So for every page I crawl > I will get 10 more pages to crawl that are not currently in the index. > As you move towards 50 million pages is becomes more like 6 to 1. If > you seed the entire dmoz, your first crawl will be around 5.5 million > pages. Your second crawl will be around 54 million pages. And a depth > of 3 will give you over 300 million pages. These are the numbers that > we are currently seeing.
Be advised though that any crawl run that collects more than 1 mln pages is bound to collect a LOT of utter junk and spam - unless you tightly control the quality of URLs, using URLFilters, ScoringFilters and other means. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
