I've been playing around with Nutch 0.9 lately and have noticed something odd that I can't explain. I'm doing whole web crawls using inject, generate, and fetch with -noParsing and 200 threads, using all the same commands on the same machine when I'm the only user. I have a shell script that estimates how fast urls are being fetched per hour based on a read of the log file every minute.
If I fetch a list of 50k urls I average a speed of ~120k per hour. With a fetch list of 1M urls I average ~40k/hour, and with a fetch list of 500k urls I get 170k/hour. I can't think of any reason I would be seeing a speed variance like this based solely on the size of the fetch list. Any ideas? For now I'm sticking with a fetch list size of 500k (obviously), but I'd like to know why I'm seeing this variance. Thanks, Tim -- View this message in context: http://www.nabble.com/Fetch-list-size-affecting-fetch-speed--tf3873977.html#a10976733 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
