Re: Indexing Nutch 1.11 indexing Fails

Jason S Sat, 23 Jan 2016 02:44:47 -0800

Bump.

Is there anyone who can help me with this?


I'm not familiar enough with Nutch source code to label this as a bug but
it seems to be the case, unless I have made some mistake being new to
Hadoop 2.  I have been running Nutch on Hadoop 1.X for years and never had
any problems like this.  Have I overlooked something in my setup?

I believe the error I posted is the one causing the indexing job to fail, I
can confirm quite a few things that are not causing the problem.

-- I have used Nutch with minimal changes to default configs, and Solr with
exactly the unmodified Schema and solrindex-mapping files provided in the
config.

-- Same error occurs on hadoop 2.4.0, 2.4.1, 2.7.1

-- Solr 4.10.2, and solr 4.10.4 makes no difference

-- Building Nutch and Solr with Open JDK or Oracle JDK makes no difference

It seems like Nutch/Hadoop never connects to Solr before it fails, Solr
logging in verbose mode creates 0 lines of output when the indexer job runs
on Hadoop.

All data/settings/everything the same works fine in local mode.

Short of dumping segments to local mode and indexing them that way, or
trying another indexer, i'm baffled.

Many thanks if someone could help me out.

Jason


On Thu, Jan 21, 2016 at 10:29 PM, Jason S <jason.stu...@gmail.com> wrote:

> Hi Markus,
>
> I guess that is part of my question, from the data in the top-level logs,
> how can I tell where to look?  I have spent a couple days trying to
> understand hadoop 2 logging , i'm still not really very sure.
>
> For example, I found this error here:
>
>
> ~/hadoop-2.4.0/logs/userlogs/application_1453403905213_0001/container_1453403905213_0001_01_000041/syslog
>
> At first I thought the "no such field" error meant I was trying to put
> data in Solr where the field didn't exist in the schema, but the same data
> indexes fine in local mode.  Also, there are no errors in Solr logs.
>
> Thanks,
>
> Jason
>
> ### syslog error ###
>
> 2016-01-21 14:21:14,211 INFO [main]
> org.apache.nutch.plugin.PluginRepository: Nutch Content Parser
> (org.apache.nutch.parse.Parser)
>
> 2016-01-21 14:21:14,211 INFO [main]
> org.apache.nutch.plugin.PluginRepository: Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
>
> 2016-01-21 14:21:14,637 INFO [main]
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter: Anchor deduplication
> is: on
>
> 2016-01-21 14:21:14,668 INFO [main] org.apache.nutch.indexer.IndexWriters:
> Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
>
> 2016-01-21 14:21:14,916 FATAL [main] org.apache.hadoop.mapred.YarnChild:
> Error running child : java.lang.NoSuchFieldError: INSTANCE
>
> at
> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:52)
>
> at
> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:56)
>
> at
> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<clinit>(DefaultHttpRequestWriterFactory.java:46)
>
> at
> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:72)
>
> at
> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:84)
>
> at
> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<clinit>(ManagedHttpClientConnectionFactory.java:59)
>
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.<init>(PoolingHttpClientConnectionManager.java:493)
>
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:149)
>
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:138)
>
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:114)
>
> at
> org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:726)
>
> at
> org.apache.nutch.indexwriter.solr.SolrUtils.getSolrServer(SolrUtils.java:57)
>
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.open(SolrIndexWriter.java:58)
>
> at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:75)
>
> at
> org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)
>
> at
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:484)
>
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
>
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>
>
> 2016-01-21 14:21:14,927 INFO [main]
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask
> metrics system...
>
> 2016-01-21 14:21:14,928 INFO [main]
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics
> system stopped.
>
> 2016-01-21 14:21:14,928 INFO [main]
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics
> system shutdown complete.
>
>
>
> On Thu, Jan 21, 2016 at 9:47 PM, Markus Jelsma <markus.jel...@openindex.io
> > wrote:
>
>> Hi Jason - these are the top-level job logs but to really know what's
>> going on, we need the actual reducer task logs.
>> Markus
>>
>>
>>
>> -----Original message-----
>> > From:Jason S <jason.stu...@gmail.com>
>> > Sent: Thursday 21st January 2016 20:35
>> > To: user@nutch.apache.org
>> > Subject: Indexing Nutch 1.11 indexing Fails
>> >
>> > Hi,
>> >
>> > I am having a problem indexing segments in Nutch 1.11 on Hadoop.
>> >
>> > The cluster seems to be configured correctly and every part of the crawl
>> > process is working flawlessly, however this is my first attempt at
>> hadoop
>> > 2, so perhaps my memory settings aren't perfect.  I'm also not sure
>> where
>> > to look in the log files for more information.
>> >
>> > The same data can be indexed with Nutch in local mode, so I don't think
>> it
>> > is a problem with the Solr configuration, and I have had Nutch 1.0.9
>> with
>> > Hadoop 1.2.1 on this same cluster and everything worked ok.
>> >
>> > Please let me know if I can send more information, I have spent several
>> > days working on this with no success or clue why it is happening.
>> >
>> > Thanks in advance,
>> >
>> > Jason
>> >
>> > ### Command ###
>> >
>> > /root/hadoop-2.4.0/bin/hadoop jar
>> > /root/src/apache-nutch-1.11/build/apache-nutch-1.11.job
>> > org.apache.nutch.indexer.IndexingJob crawl/crawldb -linkdb crawl/linkdb
>> > crawl/segments/20160121113335
>> >
>> > ### Error ###
>> >
>> > 16/01/21 14:20:47 INFO mapreduce.Job:  map 100% reduce 19%
>> > 16/01/21 14:20:48 INFO mapreduce.Job:  map 100% reduce 26%
>> > 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000001_0, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000002_0, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000000_0, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:20:49 INFO mapreduce.Job:  map 100% reduce 0%
>> > 16/01/21 14:20:54 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000004_0, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:20:55 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000002_1, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:20:56 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000001_1, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:21:00 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000000_1, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:21:01 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000004_1, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:21:02 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000002_2, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:21:07 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000003_0, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:21:08 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000004_2, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:21:08 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000001_2, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:21:11 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000000_2, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:21:15 INFO mapreduce.Job: Task Id :
>> > attempt_1453403905213_0001_r_000003_1, Status : FAILED
>> > Error: INSTANCE
>> > 16/01/21 14:21:16 INFO mapreduce.Job:  map 100% reduce 100%
>> > 16/01/21 14:21:16 INFO mapreduce.Job: Job job_1453403905213_0001 failed
>> > with state FAILED due to: Task failed task_1453403905213_0001_r_000004
>> > Job failed as tasks failed. failedMaps:0 failedReduces:1
>> >
>> > 16/01/21 14:21:16 INFO mapreduce.Job: Counters: 39
>> > File System Counters
>> > FILE: Number of bytes read=0
>> > FILE: Number of bytes written=5578886
>> > FILE: Number of read operations=0
>> > FILE: Number of large read operations=0
>> > FILE: Number of write operations=0
>> > HDFS: Number of bytes read=2277523
>> > HDFS: Number of bytes written=0
>> > HDFS: Number of read operations=80
>> > HDFS: Number of large read operations=0
>> > HDFS: Number of write operations=0
>> > Job Counters
>> > Failed reduce tasks=15
>> > Killed reduce tasks=2
>> > Launched map tasks=20
>> > Launched reduce tasks=17
>> > Data-local map tasks=19
>> > Rack-local map tasks=1
>> > Total time spent by all maps in occupied slots (ms)=334664
>> > Total time spent by all reduces in occupied slots (ms)=548199
>> > Total time spent by all map tasks (ms)=167332
>> > Total time spent by all reduce tasks (ms)=182733
>> > Total vcore-seconds taken by all map tasks=167332
>> > Total vcore-seconds taken by all reduce tasks=182733
>> > Total megabyte-seconds taken by all map tasks=257021952
>> > Total megabyte-seconds taken by all reduce tasks=561355776
>> > Map-Reduce Framework
>> > Map input records=18083
>> > Map output records=18083
>> > Map output bytes=3140643
>> > Map output materialized bytes=3178436
>> > Input split bytes=2812
>> > Combine input records=0
>> > Spilled Records=18083
>> > Failed Shuffles=0
>> > Merged Map outputs=0
>> > GC time elapsed (ms)=1182
>> > CPU time spent (ms)=56070
>> > Physical memory (bytes) snapshot=6087245824
>> > Virtual memory (bytes) snapshot=34655649792
>> > Total committed heap usage (bytes)=5412749312
>> > File Input Format Counters
>> > Bytes Read=2274711
>> > 16/01/21 14:21:16 ERROR indexer.IndexingJob: Indexer:
>> java.io.IOException:
>> > Job failed!
>> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>> > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> > at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > at java.lang.reflect.Method.invoke(Method.java:606)
>> > at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>> >
>>
>
>

Re: Indexing Nutch 1.11 indexing Fails

Reply via email to