Re: Nutch fetching times out at 3 hours, not sure why.

Sebastian Nagel Mon, 30 Apr 2018 09:21:46 -0700

Hi,

if you still see the log message


   fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!

then it can be only
 - fetcher.timelimit.mins
 - fetcher.max.exceptions.per.queue

> I crawl a list of roughly 2600 URLs all on my local server

If this is the case you can crawl more aggressively, see
  fetcher.server.delay
or even fetch in parallel from your host, see
  fetcher.threads.per.queue

Best,
Sebastian

On 04/30/2018 04:44 PM, Chip Calhoun wrote:
> I'm still experimenting with this. I had been crawling with a depth of 1 
> because I don't need anything outside my URLs list, but I tried with a depth 
> of 10. It went through a crawl loop that ended after 3 hours, then a second 3 
> hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of 
> crawling every URL in my list, though it crawled a few I hadn't included. 
> 
> Are these 3 hour loops standard for large crawls?
> 
> -----Original Message-----
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: Tuesday, April 17, 2018 3:27 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
> or even at the same point in a URL's fetcher loop; it really seems to be time 
> based. 
> 
> -----Original Message-----
> From: Sadiki Latty [mailto:sla...@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current 
> version (1.14)  so shouldn't be something you should have needed to change. 
> My crawls, by default, go for as much as even 12 hours with little to no 
> tweaking necessary from the nutch-default. Something else is causing it. Is 
> it always the same URL that it fails at?
> 
> -----Original Message-----
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
> or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
> https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've 
> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
> obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>

Re: Nutch fetching times out at 3 hours, not sure why.

Reply via email to