Hi Sheham,
the nutch-site.xml configures
<property>
<name>mapreduce.task.timeout</name>
<value>1800</value>
</property>
1.8 seconds (1800 milliseconds) is very short. The default is 600 seconds or 10
minutes, see [1]. Since Nutch needs to finish fetching before the task timeout
applies, threads fetching not quickly enough and still running at the end are
killed.
I would suggest to keep the property "mapreduce.task.timeout" on its default
value.
Best,
Sebastian
[1]
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml#mapreduce.task.timeout
On 4/24/24 16:38, Lewis John McGibbney wrote:
Hi Sheham,
On 2024/04/20 08:47:41 Sheham Izat wrote:
The Fetcher job was aborted, does that still mean that it went through the
entire list of seed urls?
Yes it processed the entire generated segment but the fetcher…
* hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/,
https://www.adu.com/ and https://www.lowes.com/
* was denied by robots.txt for https://sourceforge.net/,
https://onsclothing.com/, https://kinto-usa.com/, https://twitter.com/,
https://www.linkedin.com/, etc.
* encountered problems processing some robots.txt files for
https://twitter.com/, https://www.trustradius.com/
There may be some other issues encountered buy the fetcher.
This is not at all uncommon. The fetcher completed successfully after 7
seconds. You could progress with your crawl.
I will go through the mailing list questions.
If you need more assistance please let us know. You will find plenty of
pointers on this mailing list archive though.
lewismc