I've been playing around with Nutch 0.9 lately and have noticed something odd
that I can't explain.  I'm doing whole web crawls using inject, generate,
and fetch with -noParsing and 200 threads, using all the same commands on
the same machine when I'm the only user.  I have a shell script that
estimates how fast urls are being fetched per hour based on a read of the
log file every minute.

If I fetch a list of 50k urls I average a speed of ~120k per hour.  With a
fetch list of 1M urls I average ~40k/hour, and with a fetch list of 500k
urls I get 170k/hour.

I can't think of any reason I would be seeing a speed variance like this
based solely on the size of the fetch list.  Any ideas?  For now I'm
sticking with a fetch list size of 500k (obviously), but I'd like to know
why I'm seeing this variance.

Thanks,
Tim
-- 
View this message in context: 
http://www.nabble.com/Fetch-list-size-affecting-fetch-speed--tf3873977.html#a10976733
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to