Hi,
We're finding that we've got one or two domains that are providing
excessive retries - and that's drastically slowing our fetch process
down by hours.
Any general guidance on how to fix the problem? we've upped our max
retries variable to 3 from 1 I believe, still getting the problem.
Here's some example URL's:
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-D426F221/ama/web/travel_Group-Travel.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-CEF90BB0/ama/web/everything_auto_driver_ed.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-DEA4DDC2/ama/web/everything_auto_Vehicle-Safety.htm
http://www.plentyoffish.com/personals/3147onlinedating.htm
http://www.plentyoffish.com/personals/1032onlinedating27.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-CBE7B5E0/ama/web/insurance_Insurance-News.htm
Also, it seems like we're trying to access 100's of thousands of pages from
some of these domains - shouldn't it be limiting the number of pages from a
specific url? (I guess that's two questions).
Off hand, it looks like we've got a session variable in there. My first guess
is that somehow those may be part of the problem. These two domains are making
up something like 80-90% of our retries. Clearly we need to stop the excessive
retries, and at the same time be a bit more polite with those domains.
Thanks,
Glenn
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general