In the beginning it is approximately 10 to 1. So for every page I crawl I will get 10 more pages to crawl that are not currently in the index. As you move towards 50 million pages is becomes more like 6 to 1. If you seed the entire dmoz, your first crawl will be around 5.5 million pages. Your second crawl will be around 54 million pages. And a depth of 3 will give you over 300 million pages. These are the numbers that we are currently seeing.
Dennis Kubes bbrown wrote: > This is kind of a generic question. Are there any stats on how many pages > will get crawled based on some initial seed. For example, if you seed the > list from dmoz, how many pages will get indexed? Lets say there are 4 > million, will 4 million only get indexed? > > Or lets say I have 4000, will I get 30,000 crawled/indexed pages? > > -- > Berlin Brown > [berlin dot brown at gmail dot com] > http://botspiritcompany.com/botlist/? > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
