Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

Clark Benham Tue, 15 Jun 2021 01:25:07 -0700

Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:

<property>

  <name>plugin.includes</name>


<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints</value>


</property>

 and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


 However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
<name>plugin.folders</name> to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_000001_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_000001

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Total megabyte-milliseconds taken by all map tasks=8089088

Map-Reduce Framework

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Virtual memory (bytes) snapshot=0

2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed,
job status: FAILED, reason: Task failed task_1623740678244_0001_m_000001

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector:
java.lang.RuntimeException: Injector job did not succeed, job status:
FAILED, reason: Task failed task_1623740678244_0001_m_000001

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


at org.apache.nutch.crawl.Injector.inject(Injector.java:444)

at org.apache.nutch.crawl.Injector.run(Injector.java:571)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.nutch.crawl.Injector.main(Injector.java:535)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:323)

at org.apache.hadoop.util.RunJar.main(RunJar.java:236)




P.S.

I am using a downloaded hadoop-3.2.1; and the only odd thing about my nutch
build is that I had to replace all instances of “javac.verion” with
“ant.java.version”; as the javac version was 11 to java’s 1.8 giving the
error ‘javac invalid target release: 11’:

grep -rl "javac.version" . --include "*.xml" | xargs sed -i
s^javac.version^ant.java.version^g

grep -rl “ant.ant” . --include "*.xml"| xargs sed -i s^ant.ant.^ant.^g

Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

Reply via email to