Re: Help posting question

Sebastian Nagel Thu, 25 Apr 2024 05:42:59 -0700

Hi Sheham,

the nutch-site.xml configures


  <property>
    <name>mapreduce.task.timeout</name>
    <value>1800</value>
  </property>

1.8 seconds (1800 milliseconds) is very short. The default is 600 seconds or 10minutes, see [1]. Since Nutch needs to finish fetching before the task timeoutapplies, threads fetching not quickly enough and still running at the end arekilled.


I would suggest to keep the property "mapreduce.task.timeout" on its default 
value.

Best,
Sebastian

[1]https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml#mapreduce.task.timeout


On 4/24/24 16:38, Lewis John McGibbney wrote:

Hi Sheham,

On 2024/04/20 08:47:41 Sheham Izat wrote:

The Fetcher job was aborted, does that still mean that it went through the
entire list of seed urls?


Yes it processed the entire generated segment but the fetcher…

* hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/,  
https://www.adu.com/ and https://www.lowes.com/
* was denied by robots.txt for https://sourceforge.net/, 
https://onsclothing.com/, https://kinto-usa.com/, https://twitter.com/, 
https://www.linkedin.com/, etc.
* encountered problems processing some robots.txt files for 
https://twitter.com/, https://www.trustradius.com/
There may be some other issues encountered buy the fetcher.

This is not at all uncommon. The fetcher completed successfully after 7 
seconds. You could progress with your crawl.


I will go through the mailing list questions.


If you need more assistance please let us know. You will find plenty of 
pointers on this mailing list archive though.

lewismc

Re: Help posting question

Reply via email to