Hi,

We're finding that we've got one or two domains that are providing excessive retries - and that's drastically slowing our fetch process down by hours. Any general guidance on how to fix the problem? we've upped our max retries variable to 3 from 1 I believe, still getting the problem. Here's some example URL's:

http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-D426F221/ama/web/travel_Group-Travel.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-CEF90BB0/ama/web/everything_auto_driver_ed.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-DEA4DDC2/ama/web/everything_auto_Vehicle-Safety.htm
http://www.plentyoffish.com/personals/3147onlinedating.htm
http://www.plentyoffish.com/personals/1032onlinedating27.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-CBE7B5E0/ama/web/insurance_Insurance-News.htm

Also, it seems like we're trying to access 100's of thousands of pages from 
some of these domains - shouldn't it be limiting the number of pages from a 
specific url?  (I guess that's two questions).

Off hand, it looks like we've got a session variable in there.  My first guess 
is that somehow those may be part of the problem.  These two domains are making 
up something like 80-90% of our retries. Clearly we need to stop the excessive 
retries, and at the same time be a bit more polite with those domains.

Thanks,
Glenn




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to