Re: Indexing Nutch 1.11 indexing Fails

Sebastian Nagel Sun, 24 Jan 2016 11:08:36 -0800

Hi Jason,

sorry, that was a misunderstanding: the patch of NUTCH-2191
will not fix your problem. But Markus mentioned in the discussion
that he has to remove http* jars to fix dependency problems.
What I want to say is that our plugin system does not provide
complete isolation although every plugin has its own class loader.


However, your problem seems really weired. After a look into the
code of httpcore where the exception is raised:

https://github.com/apache/httpcore/blob/4.3.x/httpcore/src/main/java/org/apache/http/impl/io/DefaultHttpRequestWriterFactory.java#L52

The field INSTANCE is referenced and should be defined there:

https://github.com/apache/httpcore/blob/4.3.x/httpcore/src/main/java/org/apache/http/message/BasicLineFormatter.java#L65

Older versions (4.2.x) are missing this field:
https://github.com/apache/httpcore/blob/4.2.x/httpcore/src/main/java/org/apache/http/message/BasicLineFormatter.java

It's the same library (httpcore)! It's hardly possible that two class
files are taken from different versions of the same library.

> I have added -verbose:class to mapred.child.java.opts, but i don't see any
> difference in the output, I am uploading another zip of the log

Ok. Sorry, but I have to try to find out how to set -verbose:class in
(pseudo-)distributed mode. Does anyone know how to do this?

> In the past, I just copied nutch-1.9/lib to hadoop-1.2.1/lib, and if there
> was a a conflict, I kept the version of the file distributed with Nutch.
> Now the Nutch and Hadoop file structures are vastly different, so I don't
> understand, is this a problem with my configuration or with Nutch?

That's not necessary. Everything to run the Nutch jobs is contained
in apache-nutch-1.11.job. However, since you are using Hadoop 2.7.1
hadoop-1.2.1/lib or jars from there shouldn't be on the class path.
But it may be a good idea to make sure the class path isn't tainted.

Cheers,
Sebastian


On 01/24/2016 01:29 AM, Jason S wrote:
> Hi Sebastian,
> 
> I had a look at NUTCH-2191 and the suggestions in there didn't help with
> this issue.
> 
> When I apply the patch, I get a build error in 1.11 and trunk:
> 
> BUILD FAILED
> /root/src/nutch-trunk/build.xml:116: The following error occurred while
> executing this line:
> /root/src/nutch-trunk/src/plugin/build.xml:54: The following error occurred
> while executing this line:
> /root/src/nutch-trunk/src/plugin/protocol-htmlunit/build.xml:39:
> /root/src/nutch-trunk/src/plugin/protocol-htmlunit/src/test does not exist.
> 
> I'm not sure where to find the protocol-html-unit plugin.
> 
> Also, removing the http*.jar, jersey*.jar and jetty*.jar as suggested
> doesn't work.  I just keep getting the same error as above.
> 
> I have added -verbose:class to mapred.child.java.opts, but i don't see any
> difference in the output, I am uploading another zip of the log
> directories. The logs are here:
> https://s3.amazonaws.com/nutch-hadoop-error/hadoop-nutch-error2.tgz
> 
> I have searched my system, and I don't find any of the http*.jar files in
> hadoop, although one of them is in /usr/share/java, but deleting it doesn't
> seem to make a difference.
> 
> In the past, I just copied nutch-1.9/lib to hadoop-1.2.1/lib, and if there
> was a a conflict, I kept the version of the file distributed with Nutch.
> Now the Nutch and Hadoop file structures are vastly different, so I don't
> understand, is this a problem with my configuration or with Nutch?
> 
> Thanks,
> 
> Jason
> 
> 
> 
> 
> 
> On Sat, Jan 23, 2016 at 10:05 PM, Sebastian Nagel <
> wastl.na...@googlemail.com> wrote:
> 
>> Hi Jason,
>>
>> this looks like a library dependency version conflict, probably
>> between httpcore and httpclient. The class on top of the stack
>> belong to these libs:
>>  org.apache.http.impl.io.DefaultHttpRequestWriterFactory  -> httpcore
>>  org.apache.http.impl.conn.ManagedHttpClientConnectionFactory  ->
>> httpclient
>>
>> You mentioned that indexing to Solr works in local mode.
>> Is it possible that the mapreduce tasks get a wrong httpcore (or
>> httpclient)
>> lib? They should use those from the apache-nutch-1.11.job,
>> from classes/plugins/indexer-solr/ strictly speaking.
>>
>> We know that there are problems because the plugin class loader
>> asks first its parent, see [1] for the most recent discussion.
>>
>> Can you try to add -verbose:class so that you can see in the logs from
>> which jar the classes are loaded? Sorry, I didn't try this in
>> (pseudo-)distributed mode yet. According to the documentation
>> it should be possible to set this option in "mapred.child.java.opts"
>> in your mapred-site.xml (check also other *.java.opts properties)?
>>
>> Cheers,
>> Sebastian
>>
>> [1] https://issues.apache.org/jira/browse/NUTCH-2191
>>
>>
>> On 01/23/2016 04:09 PM, Jason S wrote:
>>> I'm not sure if it is ok to attach files to a list email, if anyone wants
>>> to look at some log files, they're here:
>>>
>>> https://s3.amazonaws.com/nutch-hadoop-error/hadoop-nutch-error.tgz
>>>
>>> This crawl was done on Ubuntu 15.10 and Open Jdk 8, however, I have also
>>> had the error with Ubuntu 14, Open Jdk 7 and Oracle Jdk 7, Hadoop in
>> single
>>> server mode and on a cluster with a master and 5 slaves.
>>>
>>> This crawl had minimal changes made to the config files, only
>>> http.agent.name and sol.server.url were changed.  Nutch was built with
>> ant,
>>> "ant clean runtime".
>>>
>>> Entire log directory with an entire
>>> inject/generate/fetch/parse/updatedb/index cycle is in there.  As
>> indicated
>>> in my previous messages, everything works fine until indexer, and same
>> data
>>> indexes fine in local mode.
>>>
>>> Thanks in advance,
>>>
>>> Jason
>>>
>>>
>>> On Sat, Jan 23, 2016 at 11:43 AM, Jason S <jason.stu...@gmail.com>
>> wrote:
>>>
>>>> Bump.
>>>>
>>>> Is there anyone who can help me with this?
>>>>
>>>> I'm not familiar enough with Nutch source code to label this as a bug
>> but
>>>> it seems to be the case, unless I have made some mistake being new to
>>>> Hadoop 2.  I have been running Nutch on Hadoop 1.X for years and never
>> had
>>>> any problems like this.  Have I overlooked something in my setup?
>>>>
>>>> I believe the error I posted is the one causing the indexing job to
>> fail,
>>>> I can confirm quite a few things that are not causing the problem.
>>>>
>>>> -- I have used Nutch with minimal changes to default configs, and Solr
>>>> with exactly the unmodified Schema and solrindex-mapping files provided
>> in
>>>> the config.
>>>>
>>>> -- Same error occurs on hadoop 2.4.0, 2.4.1, 2.7.1
>>>>
>>>> -- Solr 4.10.2, and solr 4.10.4 makes no difference
>>>>
>>>> -- Building Nutch and Solr with Open JDK or Oracle JDK makes no
>> difference
>>>>
>>>> It seems like Nutch/Hadoop never connects to Solr before it fails, Solr
>>>> logging in verbose mode creates 0 lines of output when the indexer job
>> runs
>>>> on Hadoop.
>>>>
>>>> All data/settings/everything the same works fine in local mode.
>>>>
>>>> Short of dumping segments to local mode and indexing them that way, or
>>>> trying another indexer, i'm baffled.
>>>>
>>>> Many thanks if someone could help me out.
>>>>
>>>> Jason
>>>>
>>>>
>>>> On Thu, Jan 21, 2016 at 10:29 PM, Jason S <jason.stu...@gmail.com>
>> wrote:
>>>>
>>>>> Hi Markus,
>>>>>
>>>>> I guess that is part of my question, from the data in the top-level
>> logs,
>>>>> how can I tell where to look?  I have spent a couple days trying to
>>>>> understand hadoop 2 logging , i'm still not really very sure.
>>>>>
>>>>> For example, I found this error here:
>>>>>
>>>>>
>>>>>
>> ~/hadoop-2.4.0/logs/userlogs/application_1453403905213_0001/container_1453403905213_0001_01_000041/syslog
>>>>>
>>>>> At first I thought the "no such field" error meant I was trying to put
>>>>> data in Solr where the field didn't exist in the schema, but the same
>> data
>>>>> indexes fine in local mode.  Also, there are no errors in Solr logs.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jason
>>>>>
>>>>> ### syslog error ###
>>>>>
>>>>> 2016-01-21 14:21:14,211 INFO [main]
>>>>> org.apache.nutch.plugin.PluginRepository: Nutch Content Parser
>>>>> (org.apache.nutch.parse.Parser)
>>>>>
>>>>> 2016-01-21 14:21:14,211 INFO [main]
>>>>> org.apache.nutch.plugin.PluginRepository: Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>>
>>>>> 2016-01-21 14:21:14,637 INFO [main]
>>>>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter: Anchor
>> deduplication
>>>>> is: on
>>>>>
>>>>> 2016-01-21 14:21:14,668 INFO [main]
>>>>> org.apache.nutch.indexer.IndexWriters: Adding
>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>
>>>>> 2016-01-21 14:21:14,916 FATAL [main]
>> org.apache.hadoop.mapred.YarnChild:
>>>>> Error running child : java.lang.NoSuchFieldError: INSTANCE
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:52)
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:56)
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<clinit>(DefaultHttpRequestWriterFactory.java:46)
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:72)
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:84)
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<clinit>(ManagedHttpClientConnectionFactory.java:59)
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.<init>(PoolingHttpClientConnectionManager.java:493)
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:149)
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:138)
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:114)
>>>>>
>>>>> at
>>>>>
>> org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:726)
>>>>>
>>>>> at
>>>>>
>> org.apache.nutch.indexwriter.solr.SolrUtils.getSolrServer(SolrUtils.java:57)
>>>>>
>>>>> at
>>>>>
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.open(SolrIndexWriter.java:58)
>>>>>
>>>>> at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:75)
>>>>>
>>>>> at
>>>>>
>> org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)
>>>>>
>>>>> at
>>>>>
>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:484)
>>>>>
>>>>> at
>> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
>>>>>
>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>>>>>
>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>>>>>
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>
>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>
>>>>> at
>>>>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>
>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>>>>>
>>>>>
>>>>> 2016-01-21 14:21:14,927 INFO [main]
>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask
>>>>> metrics system...
>>>>>
>>>>> 2016-01-21 14:21:14,928 INFO [main]
>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics
>>>>> system stopped.
>>>>>
>>>>> 2016-01-21 14:21:14,928 INFO [main]
>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics
>>>>> system shutdown complete.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 21, 2016 at 9:47 PM, Markus Jelsma <
>>>>> markus.jel...@openindex.io> wrote:
>>>>>
>>>>>> Hi Jason - these are the top-level job logs but to really know what's
>>>>>> going on, we need the actual reducer task logs.
>>>>>> Markus
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original message-----
>>>>>>> From:Jason S <jason.stu...@gmail.com>
>>>>>>> Sent: Thursday 21st January 2016 20:35
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: Indexing Nutch 1.11 indexing Fails
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am having a problem indexing segments in Nutch 1.11 on Hadoop.
>>>>>>>
>>>>>>> The cluster seems to be configured correctly and every part of the
>>>>>> crawl
>>>>>>> process is working flawlessly, however this is my first attempt at
>>>>>> hadoop
>>>>>>> 2, so perhaps my memory settings aren't perfect.  I'm also not sure
>>>>>> where
>>>>>>> to look in the log files for more information.
>>>>>>>
>>>>>>> The same data can be indexed with Nutch in local mode, so I don't
>>>>>> think it
>>>>>>> is a problem with the Solr configuration, and I have had Nutch 1.0.9
>>>>>> with
>>>>>>> Hadoop 1.2.1 on this same cluster and everything worked ok.
>>>>>>>
>>>>>>> Please let me know if I can send more information, I have spent
>> several
>>>>>>> days working on this with no success or clue why it is happening.
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> Jason
>>>>>>>
>>>>>>> ### Command ###
>>>>>>>
>>>>>>> /root/hadoop-2.4.0/bin/hadoop jar
>>>>>>> /root/src/apache-nutch-1.11/build/apache-nutch-1.11.job
>>>>>>> org.apache.nutch.indexer.IndexingJob crawl/crawldb -linkdb
>> crawl/linkdb
>>>>>>> crawl/segments/20160121113335
>>>>>>>
>>>>>>> ### Error ###
>>>>>>>
>>>>>>> 16/01/21 14:20:47 INFO mapreduce.Job:  map 100% reduce 19%
>>>>>>> 16/01/21 14:20:48 INFO mapreduce.Job:  map 100% reduce 26%
>>>>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000001_0, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000002_0, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000000_0, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:20:49 INFO mapreduce.Job:  map 100% reduce 0%
>>>>>>> 16/01/21 14:20:54 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000004_0, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:20:55 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000002_1, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:20:56 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000001_1, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:21:00 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000000_1, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:21:01 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000004_1, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:21:02 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000002_2, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:21:07 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000003_0, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:21:08 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000004_2, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:21:08 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000001_2, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:21:11 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000000_2, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:21:15 INFO mapreduce.Job: Task Id :
>>>>>>> attempt_1453403905213_0001_r_000003_1, Status : FAILED
>>>>>>> Error: INSTANCE
>>>>>>> 16/01/21 14:21:16 INFO mapreduce.Job:  map 100% reduce 100%
>>>>>>> 16/01/21 14:21:16 INFO mapreduce.Job: Job job_1453403905213_0001
>> failed
>>>>>>> with state FAILED due to: Task failed
>> task_1453403905213_0001_r_000004
>>>>>>> Job failed as tasks failed. failedMaps:0 failedReduces:1
>>>>>>>
>>>>>>> 16/01/21 14:21:16 INFO mapreduce.Job: Counters: 39
>>>>>>> File System Counters
>>>>>>> FILE: Number of bytes read=0
>>>>>>> FILE: Number of bytes written=5578886
>>>>>>> FILE: Number of read operations=0
>>>>>>> FILE: Number of large read operations=0
>>>>>>> FILE: Number of write operations=0
>>>>>>> HDFS: Number of bytes read=2277523
>>>>>>> HDFS: Number of bytes written=0
>>>>>>> HDFS: Number of read operations=80
>>>>>>> HDFS: Number of large read operations=0
>>>>>>> HDFS: Number of write operations=0
>>>>>>> Job Counters
>>>>>>> Failed reduce tasks=15
>>>>>>> Killed reduce tasks=2
>>>>>>> Launched map tasks=20
>>>>>>> Launched reduce tasks=17
>>>>>>> Data-local map tasks=19
>>>>>>> Rack-local map tasks=1
>>>>>>> Total time spent by all maps in occupied slots (ms)=334664
>>>>>>> Total time spent by all reduces in occupied slots (ms)=548199
>>>>>>> Total time spent by all map tasks (ms)=167332
>>>>>>> Total time spent by all reduce tasks (ms)=182733
>>>>>>> Total vcore-seconds taken by all map tasks=167332
>>>>>>> Total vcore-seconds taken by all reduce tasks=182733
>>>>>>> Total megabyte-seconds taken by all map tasks=257021952
>>>>>>> Total megabyte-seconds taken by all reduce tasks=561355776
>>>>>>> Map-Reduce Framework
>>>>>>> Map input records=18083
>>>>>>> Map output records=18083
>>>>>>> Map output bytes=3140643
>>>>>>> Map output materialized bytes=3178436
>>>>>>> Input split bytes=2812
>>>>>>> Combine input records=0
>>>>>>> Spilled Records=18083
>>>>>>> Failed Shuffles=0
>>>>>>> Merged Map outputs=0
>>>>>>> GC time elapsed (ms)=1182
>>>>>>> CPU time spent (ms)=56070
>>>>>>> Physical memory (bytes) snapshot=6087245824
>>>>>>> Virtual memory (bytes) snapshot=34655649792
>>>>>>> Total committed heap usage (bytes)=5412749312
>>>>>>> File Input Format Counters
>>>>>>> Bytes Read=2274711
>>>>>>> 16/01/21 14:21:16 ERROR indexer.IndexingJob: Indexer:
>>>>>> java.io.IOException:
>>>>>>> Job failed!
>>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>>>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>>>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>> at
>>>>>>>
>>>>>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>> at
>>>>>>>
>>>>>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: Indexing Nutch 1.11 indexing Fails

Reply via email to