[Nutch-dev] [jira] Commented: (NUTCH-205) Wrong 'fetch date' for non available pages

Andrzej Bialecki (JIRA) Tue, 07 Feb 2006 10:40:02 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-205?page=comments#action_12365434 ]


Andrzej Bialecki  commented on NUTCH-205:
-----------------------------------------

This is a design choice, not a bug. The errors you see are due to improper 
configuration - some threads cannot access the host for a long time, because of 
the limit of concurrent requests to a single host. Please see 
"fetcher.threads.per.host" and "http.max.delays" config properties.

> Wrong 'fetch date' for non available pages
> ------------------------------------------
>
>          Key: NUTCH-205
>          URL: http://issues.apache.org/jira/browse/NUTCH-205
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7, 0.7.1
>  Environment: JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API
>     Reporter: M.Oliver Scheele

>
> Web-Pages that couldn't be fetched because of a time-out wouldn't be 
> refetched anymore.
> The next fetch in the web-db is set to Long.max.
> Example:
> -------------
> While fetching our URLs, we got some errors like this:
> 60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html  
> failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
> Exceeded ttp.max.delays: retry later.
> That seems to be ok and indicates some network problems.
> The problem is that the entry in the Webdb shows the following:
> Page 4: Version: 4
> URL: http://www.test-domain.de/crawl_html/page_2.html
> ID: b360ec931855b0420776909bd96557c0
> Next fetch: Sun Aug 17 07:12:55 CET 292278994
> Retries since fetch: 0
> Retry interval: 0 days
> The 'Next fetch' date is set to the year '292278994'.
> Probably I wouldn't be able to see the refetch alive. ;)
> A page that couldn't be crawled because of networks-problems,
> should be refetched with the next crawl (== set next fetch date current time 
> + 1h).
> Possible Bug-Fixing:
> ----------------------------
> When updating the web-db the method updateForSegment() in the 
> UpdateDatabaseTool.class,
> set the fetch-date always to Long.max for any (unknown) exception during 
> fetching.
> The RETRY status is not always set correctly.
> Change the following lines:
> } else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
>                        page.getRetriesSinceFetch() < MAX_RETRIES) {
>               pageRetry(fo);                      // retry later
>             } else {
>               pageGone(fo);                       // give up: page is gone
>             }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-205) Wrong 'fetch date' for non available pages

Reply via email to