Re: Indexing Nutch 1.11 indexing Fails

Jason S Sat, 23 Jan 2016 07:09:54 -0800

I'm not sure if it is ok to attach files to a list email, if anyone wants
to look at some log files, they're here:


https://s3.amazonaws.com/nutch-hadoop-error/hadoop-nutch-error.tgz

This crawl was done on Ubuntu 15.10 and Open Jdk 8, however, I have also
had the error with Ubuntu 14, Open Jdk 7 and Oracle Jdk 7, Hadoop in single
server mode and on a cluster with a master and 5 slaves.

This crawl had minimal changes made to the config files, only
http.agent.name and sol.server.url were changed.  Nutch was built with ant,
"ant clean runtime".

Entire log directory with an entire
inject/generate/fetch/parse/updatedb/index cycle is in there.  As indicated
in my previous messages, everything works fine until indexer, and same data
indexes fine in local mode.

Thanks in advance,

Jason


On Sat, Jan 23, 2016 at 11:43 AM, Jason S <jason.stu...@gmail.com> wrote:

> Bump.
>
> Is there anyone who can help me with this?
>
> I'm not familiar enough with Nutch source code to label this as a bug but
> it seems to be the case, unless I have made some mistake being new to
> Hadoop 2.  I have been running Nutch on Hadoop 1.X for years and never had
> any problems like this.  Have I overlooked something in my setup?
>
> I believe the error I posted is the one causing the indexing job to fail,
> I can confirm quite a few things that are not causing the problem.
>
> -- I have used Nutch with minimal changes to default configs, and Solr
> with exactly the unmodified Schema and solrindex-mapping files provided in
> the config.
>
> -- Same error occurs on hadoop 2.4.0, 2.4.1, 2.7.1
>
> -- Solr 4.10.2, and solr 4.10.4 makes no difference
>
> -- Building Nutch and Solr with Open JDK or Oracle JDK makes no difference
>
> It seems like Nutch/Hadoop never connects to Solr before it fails, Solr
> logging in verbose mode creates 0 lines of output when the indexer job runs
> on Hadoop.
>
> All data/settings/everything the same works fine in local mode.
>
> Short of dumping segments to local mode and indexing them that way, or
> trying another indexer, i'm baffled.
>
> Many thanks if someone could help me out.
>
> Jason
>
>
> On Thu, Jan 21, 2016 at 10:29 PM, Jason S <jason.stu...@gmail.com> wrote:
>
>> Hi Markus,
>>
>> I guess that is part of my question, from the data in the top-level logs,
>> how can I tell where to look?  I have spent a couple days trying to
>> understand hadoop 2 logging , i'm still not really very sure.
>>
>> For example, I found this error here:
>>
>>
>> ~/hadoop-2.4.0/logs/userlogs/application_1453403905213_0001/container_1453403905213_0001_01_000041/syslog
>>
>> At first I thought the "no such field" error meant I was trying to put
>> data in Solr where the field didn't exist in the schema, but the same data
>> indexes fine in local mode.  Also, there are no errors in Solr logs.
>>
>> Thanks,
>>
>> Jason
>>
>> ### syslog error ###
>>
>> 2016-01-21 14:21:14,211 INFO [main]
>> org.apache.nutch.plugin.PluginRepository: Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>>
>> 2016-01-21 14:21:14,211 INFO [main]
>> org.apache.nutch.plugin.PluginRepository: Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>>
>> 2016-01-21 14:21:14,637 INFO [main]
>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter: Anchor deduplication
>> is: on
>>
>> 2016-01-21 14:21:14,668 INFO [main]
>> org.apache.nutch.indexer.IndexWriters: Adding
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>
>> 2016-01-21 14:21:14,916 FATAL [main] org.apache.hadoop.mapred.YarnChild:
>> Error running child : java.lang.NoSuchFieldError: INSTANCE
>>
>> at
>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:52)
>>
>> at
>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:56)
>>
>> at
>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<clinit>(DefaultHttpRequestWriterFactory.java:46)
>>
>> at
>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:72)
>>
>> at
>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:84)
>>
>> at
>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<clinit>(ManagedHttpClientConnectionFactory.java:59)
>>
>> at
>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.<init>(PoolingHttpClientConnectionManager.java:493)
>>
>> at
>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:149)
>>
>> at
>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:138)
>>
>> at
>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:114)
>>
>> at
>> org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:726)
>>
>> at
>> org.apache.nutch.indexwriter.solr.SolrUtils.getSolrServer(SolrUtils.java:57)
>>
>> at
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.open(SolrIndexWriter.java:58)
>>
>> at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:75)
>>
>> at
>> org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)
>>
>> at
>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:484)
>>
>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
>>
>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>>
>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>>
>>
>> 2016-01-21 14:21:14,927 INFO [main]
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask
>> metrics system...
>>
>> 2016-01-21 14:21:14,928 INFO [main]
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics
>> system stopped.
>>
>> 2016-01-21 14:21:14,928 INFO [main]
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics
>> system shutdown complete.
>>
>>
>>
>> On Thu, Jan 21, 2016 at 9:47 PM, Markus Jelsma <
>> markus.jel...@openindex.io> wrote:
>>
>>> Hi Jason - these are the top-level job logs but to really know what's
>>> going on, we need the actual reducer task logs.
>>> Markus
>>>
>>>
>>>
>>> -----Original message-----
>>> > From:Jason S <jason.stu...@gmail.com>
>>> > Sent: Thursday 21st January 2016 20:35
>>> > To: user@nutch.apache.org
>>> > Subject: Indexing Nutch 1.11 indexing Fails
>>> >
>>> > Hi,
>>> >
>>> > I am having a problem indexing segments in Nutch 1.11 on Hadoop.
>>> >
>>> > The cluster seems to be configured correctly and every part of the
>>> crawl
>>> > process is working flawlessly, however this is my first attempt at
>>> hadoop
>>> > 2, so perhaps my memory settings aren't perfect.  I'm also not sure
>>> where
>>> > to look in the log files for more information.
>>> >
>>> > The same data can be indexed with Nutch in local mode, so I don't
>>> think it
>>> > is a problem with the Solr configuration, and I have had Nutch 1.0.9
>>> with
>>> > Hadoop 1.2.1 on this same cluster and everything worked ok.
>>> >
>>> > Please let me know if I can send more information, I have spent several
>>> > days working on this with no success or clue why it is happening.
>>> >
>>> > Thanks in advance,
>>> >
>>> > Jason
>>> >
>>> > ### Command ###
>>> >
>>> > /root/hadoop-2.4.0/bin/hadoop jar
>>> > /root/src/apache-nutch-1.11/build/apache-nutch-1.11.job
>>> > org.apache.nutch.indexer.IndexingJob crawl/crawldb -linkdb crawl/linkdb
>>> > crawl/segments/20160121113335
>>> >
>>> > ### Error ###
>>> >
>>> > 16/01/21 14:20:47 INFO mapreduce.Job:  map 100% reduce 19%
>>> > 16/01/21 14:20:48 INFO mapreduce.Job:  map 100% reduce 26%
>>> > 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000001_0, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000002_0, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000000_0, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:20:49 INFO mapreduce.Job:  map 100% reduce 0%
>>> > 16/01/21 14:20:54 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000004_0, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:20:55 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000002_1, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:20:56 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000001_1, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:21:00 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000000_1, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:21:01 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000004_1, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:21:02 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000002_2, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:21:07 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000003_0, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:21:08 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000004_2, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:21:08 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000001_2, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:21:11 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000000_2, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:21:15 INFO mapreduce.Job: Task Id :
>>> > attempt_1453403905213_0001_r_000003_1, Status : FAILED
>>> > Error: INSTANCE
>>> > 16/01/21 14:21:16 INFO mapreduce.Job:  map 100% reduce 100%
>>> > 16/01/21 14:21:16 INFO mapreduce.Job: Job job_1453403905213_0001 failed
>>> > with state FAILED due to: Task failed task_1453403905213_0001_r_000004
>>> > Job failed as tasks failed. failedMaps:0 failedReduces:1
>>> >
>>> > 16/01/21 14:21:16 INFO mapreduce.Job: Counters: 39
>>> > File System Counters
>>> > FILE: Number of bytes read=0
>>> > FILE: Number of bytes written=5578886
>>> > FILE: Number of read operations=0
>>> > FILE: Number of large read operations=0
>>> > FILE: Number of write operations=0
>>> > HDFS: Number of bytes read=2277523
>>> > HDFS: Number of bytes written=0
>>> > HDFS: Number of read operations=80
>>> > HDFS: Number of large read operations=0
>>> > HDFS: Number of write operations=0
>>> > Job Counters
>>> > Failed reduce tasks=15
>>> > Killed reduce tasks=2
>>> > Launched map tasks=20
>>> > Launched reduce tasks=17
>>> > Data-local map tasks=19
>>> > Rack-local map tasks=1
>>> > Total time spent by all maps in occupied slots (ms)=334664
>>> > Total time spent by all reduces in occupied slots (ms)=548199
>>> > Total time spent by all map tasks (ms)=167332
>>> > Total time spent by all reduce tasks (ms)=182733
>>> > Total vcore-seconds taken by all map tasks=167332
>>> > Total vcore-seconds taken by all reduce tasks=182733
>>> > Total megabyte-seconds taken by all map tasks=257021952
>>> > Total megabyte-seconds taken by all reduce tasks=561355776
>>> > Map-Reduce Framework
>>> > Map input records=18083
>>> > Map output records=18083
>>> > Map output bytes=3140643
>>> > Map output materialized bytes=3178436
>>> > Input split bytes=2812
>>> > Combine input records=0
>>> > Spilled Records=18083
>>> > Failed Shuffles=0
>>> > Merged Map outputs=0
>>> > GC time elapsed (ms)=1182
>>> > CPU time spent (ms)=56070
>>> > Physical memory (bytes) snapshot=6087245824
>>> > Virtual memory (bytes) snapshot=34655649792
>>> > Total committed heap usage (bytes)=5412749312
>>> > File Input Format Counters
>>> > Bytes Read=2274711
>>> > 16/01/21 14:21:16 ERROR indexer.IndexingJob: Indexer:
>>> java.io.IOException:
>>> > Job failed!
>>> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>>> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>>> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
>>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>> > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
>>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> > at
>>> >
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> > at
>>> >
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> > at java.lang.reflect.Method.invoke(Method.java:606)
>>> > at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>>> >
>>>
>>
>>
>

Re: Indexing Nutch 1.11 indexing Fails

Reply via email to