RE: Nutch fetching times out at 3 hours, not sure why.

Chip Calhoun Thu, 19 Apr 2018 05:58:35 -0700

Hi Markus,

I don't see an indication of the web server blocking me, though that sounds 
reasonable. Could there be a per-server limit in Nutch itself that we're 
overlooking, since this is all on the same server?


Chip

-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, April 17, 2018 3:58 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

Hello Chip,

I have no clue where the three hour limit could come from. Please take a 
further look in the last few minutes of the logs.

The only thing i can think of is that a webserver would block you after some 
amount of requests/time window, that would be visible in the logs. It is clear 
Nutch itself terminates the fetcher (the dropping line). That is only possible 
with an imposed time limit, or a if you reached some number of exceptions (or 
one other variable i am forgetting).

Regards,
Markus
 
-----Original message-----
> From:Chip Calhoun <ccalh...@aip.org>
> Sent: Tuesday 17th April 2018 21:27
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
> or even at the same point in a URL's fetcher loop; it really seems to be time 
> based. 
> 
> -----Original Message-----
> From: Sadiki Latty [mailto:sla...@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current 
> version (1.14)  so shouldn't be something you should have needed to change. 
> My crawls, by default, go for as much as even 12 hours with little to no 
> tweaking necessary from the nutch-default. Something else is causing it. Is 
> it always the same URL that it fails at?
> 
> -----Original Message-----
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
> or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
> https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've 
> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
> obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
> 
>

RE: Nutch fetching times out at 3 hours, not sure why.

Reply via email to