Hi Markus, I don't see an indication of the web server blocking me, though that sounds reasonable. Could there be a per-server limit in Nutch itself that we're overlooking, since this is all on the same server?
Chip -----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, April 17, 2018 3:58 PM To: user@nutch.apache.org Subject: RE: Nutch fetching times out at 3 hours, not sure why. Hello Chip, I have no clue where the three hour limit could come from. Please take a further look in the last few minutes of the logs. The only thing i can think of is that a webserver would block you after some amount of requests/time window, that would be visible in the logs. It is clear Nutch itself terminates the fetcher (the dropping line). That is only possible with an imposed time limit, or a if you reached some number of exceptions (or one other variable i am forgetting). Regards, Markus -----Original message----- > From:Chip Calhoun <ccalh...@aip.org> > Sent: Tuesday 17th April 2018 21:27 > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, > or even at the same point in a URL's fetcher loop; it really seems to be time > based. > > -----Original Message----- > From: Sadiki Latty [mailto:sla...@uottawa.ca] > Sent: Tuesday, April 17, 2018 1:43 PM > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > Which version are you running? That value is defaulted to -1 in my current > version (1.14) so shouldn't be something you should have needed to change. > My crawls, by default, go for as much as even 12 hours with little to no > tweaking necessary from the nutch-default. Something else is causing it. Is > it always the same URL that it fails at? > > -----Original Message----- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: April-17-18 10:45 AM > To: user@nutch.apache.org > Subject: Nutch fetching times out at 3 hours, not sure why. > > I crawl a list of roughly 2600 URLs all on my local server, and I'm only > crawling around 1000 of them. The fetcher quits after exactly 3 hours (give > or take a few milliseconds) with this message in the log: > > 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: > https://history.aip.org >> dropping! > > I've seen that 3 hours is the default in some Nutch installations, but I've > got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something > obvious. Any thoughts would be greatly appreciated. Thank you. > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: ccalh...@aip.org > https://www.aip.org/history-programs/niels-bohr-library > >