
I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:





 and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.

I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
crawl/crawldb crawl/segments`


`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
(making the command `hadoop jar
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.

 However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.

I tried copying the plugin dir to S3 and setting
<name>plugin.folders</name> to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)

The full stack trace for `hadoop jar
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in

SLF4J: Found binding in

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_000001_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)


at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

#This error repeats 6 times total, 3 times for each node

2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0

2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Total megabyte-milliseconds taken by all map tasks=8089088

Map-Reduce Framework

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Virtual memory (bytes) snapshot=0

2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed,
job status: FAILED, reason: Task failed task_1623740678244_0001_m_000001

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0

2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector:
java.lang.RuntimeException: Injector job did not succeed, job status:
FAILED, reason: Task failed task_1623740678244_0001_m_000001

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0

at org.apache.nutch.crawl.Injector.inject(Injector.java:444)

at org.apache.nutch.crawl.Injector.run(Injector.java:571)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.nutch.crawl.Injector.main(Injector.java:535)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)



at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:323)

at org.apache.hadoop.util.RunJar.main(RunJar.java:236)


I am using a downloaded hadoop-3.2.1; and the only odd thing about my nutch
build is that I had to replace all instances of “javac.verion” with
“ant.java.version”; as the javac version was 11 to java’s 1.8 giving the
error ‘javac invalid target release: 11’:

grep -rl "javac.version" . --include "*.xml" | xargs sed -i

grep -rl “ant.ant” . --include "*.xml"| xargs sed -i s^ant.ant.^ant.^g

Reply via email to