Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

Clark Benham Thu, 17 Jun 2021 11:55:48 -0700

Hi Sebastian,

NUTCH_HOME=~/nutch; the local filesystem. I am using a plain, pre-built
hadoop.
There's no "mapreduce.job.dir" I can grep in Hadoop 3.2.1,3.3.0, or
Nutch-1.18, 1.19, but mapreduce.job.hdfs-servers defaults to
${fs.defaultFS}, so s3a://temp-crawler in our case.
The plugin loader doesn't appear to be able to read from s3 in nutch-1.18
with hadoop-3.2.1[1].


Using java & javac 11 with hadoop-3.3.0 downloaded and untared and a
nutch-1.19 I built:
I can run a mapreduce job on S3; and a Nutch job on hdfs, but running nutch
on S3 still gives "URLNormalizer not found" with the plugin dir on the
local filesystem or on S3a.

How would you recommend I go about getting the plugin loader to read from
other file systems?

[1]  I still get 'x point org.apache.nutch.net.URLNormalizer not found'
(same stack trace as previous email) with
`<name>plugin.folders</name>
<value>s3a://temp-crawler/user/hdoop/nutch-plugins</value>`
set in my nutch-site.xml while `hadoop fs -ls
s3a://temp-crawler/user/hdoop/nutch-plugins` lists all the plugins as there.


For posterity:
I got hadoop-3.3.0 working with a S3 backend by:

cd ~/hadoop-3.3.0

cp ./share/hadoop/tools/lib/hadoop-aws-3.3.0.jar ./share/hadoop/common/lib

cp ./share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar
./share/hadoop/common/lib
to solve "Class org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory not
found" despite the class existing in
~/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar  checking it's
on the classpath with `hadoop classpath | tr ":" "\n"  | grep
share/hadoop/tools/lib/hadoop-aws-3.3.0.jar` as well as adding it to
hadoop-env.sh.
see
https://stackoverflow.com/questions/58415928/spark-s3-error-java-lang-classnotfoundexception-class-org-apache-hadoop-f

On Tue, Jun 15, 2021 at 2:01 AM Sebastian Nagel
<[email protected]> wrote:

>  > The local file system? Or hdfs:// or even s3:// resp. s3a://?
>
> Also important: the value of "mapreduce.job.dir" - it's usually
> on hdfs:// and I'm not sure whether the plugin loader is able to
> read from other filesystems. At least, I haven't tried.
>
>
> On 6/15/21 10:53 AM, Sebastian Nagel wrote:
> > Hi Clark,
> >
> > sorry, I should read your mail until the end - you mentioned that
> > you downgraded Nutch to run with JDK 8.
> >
> > Could you share to which filesystem does NUTCH_HOME point?
> > The local file system? Or hdfs:// or even s3:// resp. s3a://?
> >
> > Best,
> > Sebastian
> >
> >
> > On 6/15/21 10:24 AM, Clark Benham wrote:
> >> Hi,
> >>
> >>
> >> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
> >> backend/filesystem; however I get an error ‘URLNormalizer class not
> found’.
> >> I have edited nutch-site.xml so this plugin should be included:
> >>
> >> <property>
> >>
> >>    <name>plugin.includes</name>
> >>
> >>
> >>
> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints</value>
>
> >>
> >>
> >>
> >> </property>
> >>
> >>   and then built on both nodes (I only have 2 machines).  I’ve
> successfully
> >> run Nutch locally and in distributed mode using HDFS, and I’ve run a
> >> mapreduce job with S3 as hadoop’s file system.
> >>
> >>
> >> I thought it was possible nutch is not reading nutch-site.xml because I
> >> resolve an error by setting the config through the cli, despite this
> >> duplicating nutch-site.xml.
> >>
> >> The command:
> >>
> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.fetcher.Fetcher
> >> crawl/crawldb crawl/segments`
> >>
> >> throws
> >>
> >> `java.lang.IllegalArgumentException: Fetcher: No agents listed in '
> >> http.agent.name' property`
> >>
> >> while if I pass a value in for http.agent.name with
> >> `-Dhttp.agent.name=myScrapper`,
> >> (making the command `hadoop jar
> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.fetcher.Fetcher
> >> -Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
> >> about there being no input path, which makes sense as I haven’t been
> able
> >> to generate any segments.
> >>
> >>
> >>   However this method of setting nutch config’s doesn’t work for
> injecting
> >> URLs; eg:
> >>
> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.crawl.Injector
> >> -Dplugin.includes=".*" crawl/crawldb urls`
> >>
> >> fails with the same “URLNormalizer” not found.
> >>
> >>
> >> I tried copying the plugin dir to S3 and setting
> >> <name>plugin.folders</name> to be a path on S3 without success. (I
> expect
> >> the plugin to be bundled with the .job so this step should be
> unnecessary)
> >>
> >>
> >> The full stack trace for `hadoop jar
> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.crawl.Injector
> >> crawl/crawldb urls`:
> >>
> >> SLF4J: Class path contains multiple SLF4J bindings.
> >>
> >> SLF4J: Found binding in
> >>
> [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >>
> >> SLF4J: Found binding in
> >>
> [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >>
> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> >> explanation.
> >>
> >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> >>
> >> #Took out multiply Info messages
> >>
> >> 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
> >> attempt_1623740678244_0001_m_000001_0, Status : FAILED
> >>
> >> Error: java.lang.RuntimeException: x point
> >> org.apache.nutch.net.URLNormalizer not found.
> >>
> >> at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:145)
> >>
> >> at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)
> >>
> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> >>
> >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
> >>
> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
> >>
> >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
> >>
> >> at java.security.AccessController.doPrivileged(Native Method)
> >>
> >> at javax.security.auth.Subject.doAs(Subject.java:422)
> >>
> >> at
> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> >>
> >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> >>
> >>
> >> #This error repeats 6 times total, 3 times for each node
> >>
> >>
> >> 2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%
> >>
> >> 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
> >> failed with state FAILED due to: Task failed
> >> task_1623740678244_0001_m_000001
> >>
> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
> >> killedReduces: 0
> >>
> >>
> >> 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14
> >>
> >> Job Counters
> >>
> >> Failed map tasks=7
> >>
> >> Killed map tasks=1
> >>
> >> Killed reduce tasks=1
> >>
> >> Launched map tasks=8
> >>
> >> Other local map tasks=6
> >>
> >> Rack-local map tasks=2
> >>
> >> Total time spent by all maps in occupied slots (ms)=63196
> >>
> >> Total time spent by all reduces in occupied slots (ms)=0
> >>
> >> Total time spent by all map tasks (ms)=31598
> >>
> >> Total vcore-milliseconds taken by all map tasks=31598
> >>
> >> Total megabyte-milliseconds taken by all map tasks=8089088
> >>
> >> Map-Reduce Framework
> >>
> >> CPU time spent (ms)=0
> >>
> >> Physical memory (bytes) snapshot=0
> >>
> >> Virtual memory (bytes) snapshot=0
> >>
> >> 2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not
> succeed,
> >> job status: FAILED, reason: Task failed task_1623740678244_0001_m_000001
> >>
> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
> >> killedReduces: 0
> >>
> >>
> >> 2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector:
> >> java.lang.RuntimeException: Injector job did not succeed, job status:
> >> FAILED, reason: Task failed task_1623740678244_0001_m_000001
> >>
> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
> >> killedReduces: 0
> >>
> >>
> >> at org.apache.nutch.crawl.Injector.inject(Injector.java:444)
> >>
> >> at org.apache.nutch.crawl.Injector.run(Injector.java:571)
> >>
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> >>
> >> at org.apache.nutch.crawl.Injector.main(Injector.java:535)
> >>
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >> at
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >>
> >> at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>
> >> at java.lang.reflect.Method.invoke(Method.java:498)
> >>
> >> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> >>
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> >>
> >>
> >>
> >>
> >> P.S.
> >>
> >> I am using a downloaded hadoop-3.2.1; and the only odd thing about my
> nutch
> >> build is that I had to replace all instances of “javac.verion” with
> >> “ant.java.version”; as the javac version was 11 to java’s 1.8 giving the
> >> error ‘javac invalid target release: 11’:
> >>
> >> grep -rl "javac.version" . --include "*.xml" | xargs sed -i
> >> s^javac.version^ant.java.version^g
> >>
> >> grep -rl “ant.ant” . --include "*.xml"| xargs sed -i s^ant.ant.^ant.^g
> >>
> >
>
>

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

Reply via email to