Re: Indexing Nutch 1.11 indexing Fails

Jason S Thu, 21 Jan 2016 13:32:31 -0800

Hi Markus,

I guess that is part of my question, from the data in the top-level logs,
how can I tell where to look?  I have spent a couple days trying to
understand hadoop 2 logging , i'm still not really very sure.


For example, I found this error here:

~/hadoop-2.4.0/logs/userlogs/application_1453403905213_0001/container_1453403905213_0001_01_000041/syslog

At first I thought the "no such field" error meant I was trying to put data
in Solr where the field didn't exist in the schema, but the same data
indexes fine in local mode.  Also, there are no errors in Solr logs.

Thanks,

Jason

### syslog error ###

2016-01-21 14:21:14,211 INFO [main]
org.apache.nutch.plugin.PluginRepository: Nutch Content Parser
(org.apache.nutch.parse.Parser)

2016-01-21 14:21:14,211 INFO [main]
org.apache.nutch.plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)

2016-01-21 14:21:14,637 INFO [main]
org.apache.nutch.indexer.anchor.AnchorIndexingFilter: Anchor deduplication
is: on

2016-01-21 14:21:14,668 INFO [main] org.apache.nutch.indexer.IndexWriters:
Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter

2016-01-21 14:21:14,916 FATAL [main] org.apache.hadoop.mapred.YarnChild:
Error running child : java.lang.NoSuchFieldError: INSTANCE

at
org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:52)

at
org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:56)

at
org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<clinit>(DefaultHttpRequestWriterFactory.java:46)

at
org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:72)

at
org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:84)

at
org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<clinit>(ManagedHttpClientConnectionFactory.java:59)

at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.<init>(PoolingHttpClientConnectionManager.java:493)

at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:149)

at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:138)

at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:114)

at
org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:726)

at
org.apache.nutch.indexwriter.solr.SolrUtils.getSolrServer(SolrUtils.java:57)

at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.open(SolrIndexWriter.java:58)

at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:75)

at
org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)

at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:484)

at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)


2016-01-21 14:21:14,927 INFO [main]
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask
metrics system...

2016-01-21 14:21:14,928 INFO [main]
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics
system stopped.

2016-01-21 14:21:14,928 INFO [main]
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics
system shutdown complete.



On Thu, Jan 21, 2016 at 9:47 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hi Jason - these are the top-level job logs but to really know what's
> going on, we need the actual reducer task logs.
> Markus
>
>
>
> -----Original message-----
> > From:Jason S <jason.stu...@gmail.com>
> > Sent: Thursday 21st January 2016 20:35
> > To: user@nutch.apache.org
> > Subject: Indexing Nutch 1.11 indexing Fails
> >
> > Hi,
> >
> > I am having a problem indexing segments in Nutch 1.11 on Hadoop.
> >
> > The cluster seems to be configured correctly and every part of the crawl
> > process is working flawlessly, however this is my first attempt at hadoop
> > 2, so perhaps my memory settings aren't perfect.  I'm also not sure where
> > to look in the log files for more information.
> >
> > The same data can be indexed with Nutch in local mode, so I don't think
> it
> > is a problem with the Solr configuration, and I have had Nutch 1.0.9 with
> > Hadoop 1.2.1 on this same cluster and everything worked ok.
> >
> > Please let me know if I can send more information, I have spent several
> > days working on this with no success or clue why it is happening.
> >
> > Thanks in advance,
> >
> > Jason
> >
> > ### Command ###
> >
> > /root/hadoop-2.4.0/bin/hadoop jar
> > /root/src/apache-nutch-1.11/build/apache-nutch-1.11.job
> > org.apache.nutch.indexer.IndexingJob crawl/crawldb -linkdb crawl/linkdb
> > crawl/segments/20160121113335
> >
> > ### Error ###
> >
> > 16/01/21 14:20:47 INFO mapreduce.Job:  map 100% reduce 19%
> > 16/01/21 14:20:48 INFO mapreduce.Job:  map 100% reduce 26%
> > 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000001_0, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000002_0, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000000_0, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:20:49 INFO mapreduce.Job:  map 100% reduce 0%
> > 16/01/21 14:20:54 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000004_0, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:20:55 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000002_1, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:20:56 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000001_1, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:21:00 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000000_1, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:21:01 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000004_1, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:21:02 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000002_2, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:21:07 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000003_0, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:21:08 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000004_2, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:21:08 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000001_2, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:21:11 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000000_2, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:21:15 INFO mapreduce.Job: Task Id :
> > attempt_1453403905213_0001_r_000003_1, Status : FAILED
> > Error: INSTANCE
> > 16/01/21 14:21:16 INFO mapreduce.Job:  map 100% reduce 100%
> > 16/01/21 14:21:16 INFO mapreduce.Job: Job job_1453403905213_0001 failed
> > with state FAILED due to: Task failed task_1453403905213_0001_r_000004
> > Job failed as tasks failed. failedMaps:0 failedReduces:1
> >
> > 16/01/21 14:21:16 INFO mapreduce.Job: Counters: 39
> > File System Counters
> > FILE: Number of bytes read=0
> > FILE: Number of bytes written=5578886
> > FILE: Number of read operations=0
> > FILE: Number of large read operations=0
> > FILE: Number of write operations=0
> > HDFS: Number of bytes read=2277523
> > HDFS: Number of bytes written=0
> > HDFS: Number of read operations=80
> > HDFS: Number of large read operations=0
> > HDFS: Number of write operations=0
> > Job Counters
> > Failed reduce tasks=15
> > Killed reduce tasks=2
> > Launched map tasks=20
> > Launched reduce tasks=17
> > Data-local map tasks=19
> > Rack-local map tasks=1
> > Total time spent by all maps in occupied slots (ms)=334664
> > Total time spent by all reduces in occupied slots (ms)=548199
> > Total time spent by all map tasks (ms)=167332
> > Total time spent by all reduce tasks (ms)=182733
> > Total vcore-seconds taken by all map tasks=167332
> > Total vcore-seconds taken by all reduce tasks=182733
> > Total megabyte-seconds taken by all map tasks=257021952
> > Total megabyte-seconds taken by all reduce tasks=561355776
> > Map-Reduce Framework
> > Map input records=18083
> > Map output records=18083
> > Map output bytes=3140643
> > Map output materialized bytes=3178436
> > Input split bytes=2812
> > Combine input records=0
> > Spilled Records=18083
> > Failed Shuffles=0
> > Merged Map outputs=0
> > GC time elapsed (ms)=1182
> > CPU time spent (ms)=56070
> > Physical memory (bytes) snapshot=6087245824
> > Virtual memory (bytes) snapshot=34655649792
> > Total committed heap usage (bytes)=5412749312
> > File Input Format Counters
> > Bytes Read=2274711
> > 16/01/21 14:21:16 ERROR indexer.IndexingJob: Indexer:
> java.io.IOException:
> > Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:606)
> > at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> >
>

Re: Indexing Nutch 1.11 indexing Fails

Reply via email to