Hi Jason, sorry, that was a misunderstanding: the patch of NUTCH-2191 will not fix your problem. But Markus mentioned in the discussion that he has to remove http* jars to fix dependency problems. What I want to say is that our plugin system does not provide complete isolation although every plugin has its own class loader.
However, your problem seems really weired. After a look into the code of httpcore where the exception is raised: https://github.com/apache/httpcore/blob/4.3.x/httpcore/src/main/java/org/apache/http/impl/io/DefaultHttpRequestWriterFactory.java#L52 The field INSTANCE is referenced and should be defined there: https://github.com/apache/httpcore/blob/4.3.x/httpcore/src/main/java/org/apache/http/message/BasicLineFormatter.java#L65 Older versions (4.2.x) are missing this field: https://github.com/apache/httpcore/blob/4.2.x/httpcore/src/main/java/org/apache/http/message/BasicLineFormatter.java It's the same library (httpcore)! It's hardly possible that two class files are taken from different versions of the same library. > I have added -verbose:class to mapred.child.java.opts, but i don't see any > difference in the output, I am uploading another zip of the log Ok. Sorry, but I have to try to find out how to set -verbose:class in (pseudo-)distributed mode. Does anyone know how to do this? > In the past, I just copied nutch-1.9/lib to hadoop-1.2.1/lib, and if there > was a a conflict, I kept the version of the file distributed with Nutch. > Now the Nutch and Hadoop file structures are vastly different, so I don't > understand, is this a problem with my configuration or with Nutch? That's not necessary. Everything to run the Nutch jobs is contained in apache-nutch-1.11.job. However, since you are using Hadoop 2.7.1 hadoop-1.2.1/lib or jars from there shouldn't be on the class path. But it may be a good idea to make sure the class path isn't tainted. Cheers, Sebastian On 01/24/2016 01:29 AM, Jason S wrote: > Hi Sebastian, > > I had a look at NUTCH-2191 and the suggestions in there didn't help with > this issue. > > When I apply the patch, I get a build error in 1.11 and trunk: > > BUILD FAILED > /root/src/nutch-trunk/build.xml:116: The following error occurred while > executing this line: > /root/src/nutch-trunk/src/plugin/build.xml:54: The following error occurred > while executing this line: > /root/src/nutch-trunk/src/plugin/protocol-htmlunit/build.xml:39: > /root/src/nutch-trunk/src/plugin/protocol-htmlunit/src/test does not exist. > > I'm not sure where to find the protocol-html-unit plugin. > > Also, removing the http*.jar, jersey*.jar and jetty*.jar as suggested > doesn't work. I just keep getting the same error as above. > > I have added -verbose:class to mapred.child.java.opts, but i don't see any > difference in the output, I am uploading another zip of the log > directories. The logs are here: > https://s3.amazonaws.com/nutch-hadoop-error/hadoop-nutch-error2.tgz > > I have searched my system, and I don't find any of the http*.jar files in > hadoop, although one of them is in /usr/share/java, but deleting it doesn't > seem to make a difference. > > In the past, I just copied nutch-1.9/lib to hadoop-1.2.1/lib, and if there > was a a conflict, I kept the version of the file distributed with Nutch. > Now the Nutch and Hadoop file structures are vastly different, so I don't > understand, is this a problem with my configuration or with Nutch? > > Thanks, > > Jason > > > > > > On Sat, Jan 23, 2016 at 10:05 PM, Sebastian Nagel < > wastl.na...@googlemail.com> wrote: > >> Hi Jason, >> >> this looks like a library dependency version conflict, probably >> between httpcore and httpclient. The class on top of the stack >> belong to these libs: >> org.apache.http.impl.io.DefaultHttpRequestWriterFactory -> httpcore >> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory -> >> httpclient >> >> You mentioned that indexing to Solr works in local mode. >> Is it possible that the mapreduce tasks get a wrong httpcore (or >> httpclient) >> lib? They should use those from the apache-nutch-1.11.job, >> from classes/plugins/indexer-solr/ strictly speaking. >> >> We know that there are problems because the plugin class loader >> asks first its parent, see [1] for the most recent discussion. >> >> Can you try to add -verbose:class so that you can see in the logs from >> which jar the classes are loaded? Sorry, I didn't try this in >> (pseudo-)distributed mode yet. According to the documentation >> it should be possible to set this option in "mapred.child.java.opts" >> in your mapred-site.xml (check also other *.java.opts properties)? >> >> Cheers, >> Sebastian >> >> [1] https://issues.apache.org/jira/browse/NUTCH-2191 >> >> >> On 01/23/2016 04:09 PM, Jason S wrote: >>> I'm not sure if it is ok to attach files to a list email, if anyone wants >>> to look at some log files, they're here: >>> >>> https://s3.amazonaws.com/nutch-hadoop-error/hadoop-nutch-error.tgz >>> >>> This crawl was done on Ubuntu 15.10 and Open Jdk 8, however, I have also >>> had the error with Ubuntu 14, Open Jdk 7 and Oracle Jdk 7, Hadoop in >> single >>> server mode and on a cluster with a master and 5 slaves. >>> >>> This crawl had minimal changes made to the config files, only >>> http.agent.name and sol.server.url were changed. Nutch was built with >> ant, >>> "ant clean runtime". >>> >>> Entire log directory with an entire >>> inject/generate/fetch/parse/updatedb/index cycle is in there. As >> indicated >>> in my previous messages, everything works fine until indexer, and same >> data >>> indexes fine in local mode. >>> >>> Thanks in advance, >>> >>> Jason >>> >>> >>> On Sat, Jan 23, 2016 at 11:43 AM, Jason S <jason.stu...@gmail.com> >> wrote: >>> >>>> Bump. >>>> >>>> Is there anyone who can help me with this? >>>> >>>> I'm not familiar enough with Nutch source code to label this as a bug >> but >>>> it seems to be the case, unless I have made some mistake being new to >>>> Hadoop 2. I have been running Nutch on Hadoop 1.X for years and never >> had >>>> any problems like this. Have I overlooked something in my setup? >>>> >>>> I believe the error I posted is the one causing the indexing job to >> fail, >>>> I can confirm quite a few things that are not causing the problem. >>>> >>>> -- I have used Nutch with minimal changes to default configs, and Solr >>>> with exactly the unmodified Schema and solrindex-mapping files provided >> in >>>> the config. >>>> >>>> -- Same error occurs on hadoop 2.4.0, 2.4.1, 2.7.1 >>>> >>>> -- Solr 4.10.2, and solr 4.10.4 makes no difference >>>> >>>> -- Building Nutch and Solr with Open JDK or Oracle JDK makes no >> difference >>>> >>>> It seems like Nutch/Hadoop never connects to Solr before it fails, Solr >>>> logging in verbose mode creates 0 lines of output when the indexer job >> runs >>>> on Hadoop. >>>> >>>> All data/settings/everything the same works fine in local mode. >>>> >>>> Short of dumping segments to local mode and indexing them that way, or >>>> trying another indexer, i'm baffled. >>>> >>>> Many thanks if someone could help me out. >>>> >>>> Jason >>>> >>>> >>>> On Thu, Jan 21, 2016 at 10:29 PM, Jason S <jason.stu...@gmail.com> >> wrote: >>>> >>>>> Hi Markus, >>>>> >>>>> I guess that is part of my question, from the data in the top-level >> logs, >>>>> how can I tell where to look? I have spent a couple days trying to >>>>> understand hadoop 2 logging , i'm still not really very sure. >>>>> >>>>> For example, I found this error here: >>>>> >>>>> >>>>> >> ~/hadoop-2.4.0/logs/userlogs/application_1453403905213_0001/container_1453403905213_0001_01_000041/syslog >>>>> >>>>> At first I thought the "no such field" error meant I was trying to put >>>>> data in Solr where the field didn't exist in the schema, but the same >> data >>>>> indexes fine in local mode. Also, there are no errors in Solr logs. >>>>> >>>>> Thanks, >>>>> >>>>> Jason >>>>> >>>>> ### syslog error ### >>>>> >>>>> 2016-01-21 14:21:14,211 INFO [main] >>>>> org.apache.nutch.plugin.PluginRepository: Nutch Content Parser >>>>> (org.apache.nutch.parse.Parser) >>>>> >>>>> 2016-01-21 14:21:14,211 INFO [main] >>>>> org.apache.nutch.plugin.PluginRepository: Nutch Scoring >>>>> (org.apache.nutch.scoring.ScoringFilter) >>>>> >>>>> 2016-01-21 14:21:14,637 INFO [main] >>>>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter: Anchor >> deduplication >>>>> is: on >>>>> >>>>> 2016-01-21 14:21:14,668 INFO [main] >>>>> org.apache.nutch.indexer.IndexWriters: Adding >>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter >>>>> >>>>> 2016-01-21 14:21:14,916 FATAL [main] >> org.apache.hadoop.mapred.YarnChild: >>>>> Error running child : java.lang.NoSuchFieldError: INSTANCE >>>>> >>>>> at >>>>> >> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:52) >>>>> >>>>> at >>>>> >> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:56) >>>>> >>>>> at >>>>> >> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<clinit>(DefaultHttpRequestWriterFactory.java:46) >>>>> >>>>> at >>>>> >> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:72) >>>>> >>>>> at >>>>> >> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:84) >>>>> >>>>> at >>>>> >> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<clinit>(ManagedHttpClientConnectionFactory.java:59) >>>>> >>>>> at >>>>> >> org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.<init>(PoolingHttpClientConnectionManager.java:493) >>>>> >>>>> at >>>>> >> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:149) >>>>> >>>>> at >>>>> >> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:138) >>>>> >>>>> at >>>>> >> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:114) >>>>> >>>>> at >>>>> >> org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:726) >>>>> >>>>> at >>>>> >> org.apache.nutch.indexwriter.solr.SolrUtils.getSolrServer(SolrUtils.java:57) >>>>> >>>>> at >>>>> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.open(SolrIndexWriter.java:58) >>>>> >>>>> at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:75) >>>>> >>>>> at >>>>> >> org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39) >>>>> >>>>> at >>>>> >> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:484) >>>>> >>>>> at >> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414) >>>>> >>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) >>>>> >>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) >>>>> >>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>> >>>>> at javax.security.auth.Subject.doAs(Subject.java:415) >>>>> >>>>> at >>>>> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) >>>>> >>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) >>>>> >>>>> >>>>> 2016-01-21 14:21:14,927 INFO [main] >>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask >>>>> metrics system... >>>>> >>>>> 2016-01-21 14:21:14,928 INFO [main] >>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics >>>>> system stopped. >>>>> >>>>> 2016-01-21 14:21:14,928 INFO [main] >>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics >>>>> system shutdown complete. >>>>> >>>>> >>>>> >>>>> On Thu, Jan 21, 2016 at 9:47 PM, Markus Jelsma < >>>>> markus.jel...@openindex.io> wrote: >>>>> >>>>>> Hi Jason - these are the top-level job logs but to really know what's >>>>>> going on, we need the actual reducer task logs. >>>>>> Markus >>>>>> >>>>>> >>>>>> >>>>>> -----Original message----- >>>>>>> From:Jason S <jason.stu...@gmail.com> >>>>>>> Sent: Thursday 21st January 2016 20:35 >>>>>>> To: user@nutch.apache.org >>>>>>> Subject: Indexing Nutch 1.11 indexing Fails >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am having a problem indexing segments in Nutch 1.11 on Hadoop. >>>>>>> >>>>>>> The cluster seems to be configured correctly and every part of the >>>>>> crawl >>>>>>> process is working flawlessly, however this is my first attempt at >>>>>> hadoop >>>>>>> 2, so perhaps my memory settings aren't perfect. I'm also not sure >>>>>> where >>>>>>> to look in the log files for more information. >>>>>>> >>>>>>> The same data can be indexed with Nutch in local mode, so I don't >>>>>> think it >>>>>>> is a problem with the Solr configuration, and I have had Nutch 1.0.9 >>>>>> with >>>>>>> Hadoop 1.2.1 on this same cluster and everything worked ok. >>>>>>> >>>>>>> Please let me know if I can send more information, I have spent >> several >>>>>>> days working on this with no success or clue why it is happening. >>>>>>> >>>>>>> Thanks in advance, >>>>>>> >>>>>>> Jason >>>>>>> >>>>>>> ### Command ### >>>>>>> >>>>>>> /root/hadoop-2.4.0/bin/hadoop jar >>>>>>> /root/src/apache-nutch-1.11/build/apache-nutch-1.11.job >>>>>>> org.apache.nutch.indexer.IndexingJob crawl/crawldb -linkdb >> crawl/linkdb >>>>>>> crawl/segments/20160121113335 >>>>>>> >>>>>>> ### Error ### >>>>>>> >>>>>>> 16/01/21 14:20:47 INFO mapreduce.Job: map 100% reduce 19% >>>>>>> 16/01/21 14:20:48 INFO mapreduce.Job: map 100% reduce 26% >>>>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000001_0, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000002_0, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000000_0, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:20:49 INFO mapreduce.Job: map 100% reduce 0% >>>>>>> 16/01/21 14:20:54 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000004_0, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:20:55 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000002_1, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:20:56 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000001_1, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:21:00 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000000_1, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:21:01 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000004_1, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:21:02 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000002_2, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:21:07 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000003_0, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:21:08 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000004_2, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:21:08 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000001_2, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:21:11 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000000_2, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:21:15 INFO mapreduce.Job: Task Id : >>>>>>> attempt_1453403905213_0001_r_000003_1, Status : FAILED >>>>>>> Error: INSTANCE >>>>>>> 16/01/21 14:21:16 INFO mapreduce.Job: map 100% reduce 100% >>>>>>> 16/01/21 14:21:16 INFO mapreduce.Job: Job job_1453403905213_0001 >> failed >>>>>>> with state FAILED due to: Task failed >> task_1453403905213_0001_r_000004 >>>>>>> Job failed as tasks failed. failedMaps:0 failedReduces:1 >>>>>>> >>>>>>> 16/01/21 14:21:16 INFO mapreduce.Job: Counters: 39 >>>>>>> File System Counters >>>>>>> FILE: Number of bytes read=0 >>>>>>> FILE: Number of bytes written=5578886 >>>>>>> FILE: Number of read operations=0 >>>>>>> FILE: Number of large read operations=0 >>>>>>> FILE: Number of write operations=0 >>>>>>> HDFS: Number of bytes read=2277523 >>>>>>> HDFS: Number of bytes written=0 >>>>>>> HDFS: Number of read operations=80 >>>>>>> HDFS: Number of large read operations=0 >>>>>>> HDFS: Number of write operations=0 >>>>>>> Job Counters >>>>>>> Failed reduce tasks=15 >>>>>>> Killed reduce tasks=2 >>>>>>> Launched map tasks=20 >>>>>>> Launched reduce tasks=17 >>>>>>> Data-local map tasks=19 >>>>>>> Rack-local map tasks=1 >>>>>>> Total time spent by all maps in occupied slots (ms)=334664 >>>>>>> Total time spent by all reduces in occupied slots (ms)=548199 >>>>>>> Total time spent by all map tasks (ms)=167332 >>>>>>> Total time spent by all reduce tasks (ms)=182733 >>>>>>> Total vcore-seconds taken by all map tasks=167332 >>>>>>> Total vcore-seconds taken by all reduce tasks=182733 >>>>>>> Total megabyte-seconds taken by all map tasks=257021952 >>>>>>> Total megabyte-seconds taken by all reduce tasks=561355776 >>>>>>> Map-Reduce Framework >>>>>>> Map input records=18083 >>>>>>> Map output records=18083 >>>>>>> Map output bytes=3140643 >>>>>>> Map output materialized bytes=3178436 >>>>>>> Input split bytes=2812 >>>>>>> Combine input records=0 >>>>>>> Spilled Records=18083 >>>>>>> Failed Shuffles=0 >>>>>>> Merged Map outputs=0 >>>>>>> GC time elapsed (ms)=1182 >>>>>>> CPU time spent (ms)=56070 >>>>>>> Physical memory (bytes) snapshot=6087245824 >>>>>>> Virtual memory (bytes) snapshot=34655649792 >>>>>>> Total committed heap usage (bytes)=5412749312 >>>>>>> File Input Format Counters >>>>>>> Bytes Read=2274711 >>>>>>> 16/01/21 14:21:16 ERROR indexer.IndexingJob: Indexer: >>>>>> java.io.IOException: >>>>>>> Job failed! >>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) >>>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) >>>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) >>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) >>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>>> at >>>>>>> >>>>>> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>>>> at >>>>>>> >>>>>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:212) >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >> >