[ https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Otis Gospodnetic updated NUTCH-629: ----------------------------------- Attachment: NUTCH-629.patch > Detect slow and timeout servers and drop their URLs > --------------------------------------------------- > > Key: NUTCH-629 > URL: https://issues.apache.org/jira/browse/NUTCH-629 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Reporter: Otis Gospodnetic > Attachments: NUTCH-629.patch > > > Fetch jobs will finish faster if we find a way to prevent servers that are > either slow or time out from slowing down the whole process. > I'll attach a patch that counts per-server exceptions and timeouts and tracks > download speed per server. > Queues/sservers that exceed timeout or download thresholds are marked as > "tooManyErrors" or "tooSlow". Once they get marked as such, all of their > subsequent URLs get dropped (i.e. they do not fetched) and marked GONE. > At the end of the fetch task, stats for each server processed are printed. > Also, I believe the per-host/domain/TLD/etc. DB from NUTCH-628 would be the > right place to add server data collected by this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.