Hi Sheham,

the nutch-site.xml configures

  <property>
    <name>mapreduce.task.timeout</name>
    <value>1800</value>
  </property>

1.8 seconds (1800 milliseconds) is very short. The default is 600 seconds or 10 minutes, see [1]. Since Nutch needs to finish fetching before the task timeout applies, threads fetching not quickly enough and still running at the end are killed.

I would suggest to keep the property "mapreduce.task.timeout" on its default 
value.

Best,
Sebastian

[1] https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml#mapreduce.task.timeout

On 4/24/24 16:38, Lewis John McGibbney wrote:
Hi Sheham,

On 2024/04/20 08:47:41 Sheham Izat wrote:

The Fetcher job was aborted, does that still mean that it went through the
entire list of seed urls?

Yes it processed the entire generated segment but the fetcher…

* hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/,  
https://www.adu.com/ and https://www.lowes.com/
* was denied by robots.txt for https://sourceforge.net/, 
https://onsclothing.com/, https://kinto-usa.com/, https://twitter.com/, 
https://www.linkedin.com/, etc.
* encountered problems processing some robots.txt files for 
https://twitter.com/, https://www.trustradius.com/
There may be some other issues encountered buy the fetcher.

This is not at all uncommon. The fetcher completed successfully after 7 
seconds. You could progress with your crawl.


I will go through the mailing list questions.

If you need more assistance please let us know. You will find plenty of 
pointers on this mailing list archive though.

lewismc

Reply via email to