Hi Alex, > But info about another experiences with Nutch2+hadoop2 will also good..
I set up Nutch 2.3 + CDH 4.7 (HBase 0.94, Hadoop 2.0 etc) a few months ago, and it's working fine. I used the latest code from svn with no modifications, and followed the tutorial below: http://wiki.apache.org/nutch/Nutch2Tutorial HTH, Kaz 2014-10-03 22:03 GMT+09:00 Alex Median <[email protected]>: > > Hi, > > Within a month I'm in the process of installing Nutch 2.3 in this > configuration (subj). > Nutch 2 initially with Hadoop 1 was chosen a few months ago, some of the > coding is already done. > We chose Amazon AWS Elastic MapReduce (EMR) as a platform. > Unfortunately EMR Hadoop 1 version on an old Debian does not suit us. > Therefore, we need to establish exactly Nutch 2 in the above configuration: > Hadoop 2.4.0 + HBase 0.94.18 (Amazon Linux: AMI version:3.2.1, Hadoop > distribution:Amazon 2.4.0, Applications:HBase 0.94.18) > > But info about another experiences with Nutch2+hadoop2 will also good.. > > What has been done for the last iteration of the installation on local > computer: > > 1. Nutch 2.x > 1.1 svn current 2.x version > 1.2. prepared scripts: > 1.2.1 ivy: > <dependency org="org.apache.hadoop" name="hadoop-common" rev="2.4.0">.. > <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" > rev="2.4.0">.. > <dependency org="org.apache.gora" name="gora" rev="0.5" conf="*->default" /> > <dependency org="org.apache.gora" name="gora-hbase" rev="0.5" > conf="*->default" /> > etc. > 1.2.2 default.properties: > hadoop.version=2.4.0 > version=2.3-SNAPSHOT > etc. > 1.3. added public int getFieldsCount() { return Field.values().length; } to > ProtocolStatus.java, ParseStatus.java, Host.java, WebPage.java. > > 2. HBase > 2.1 svn HBase 0.94.18 > 2.2 prepared for Protobuf 2.5.0 [1], also thanks to Dobromyslov [5] > 2.3 also generated hbase-0.94.18-hadoop-2.4.0.jar > > 3. Gora 0.5 (also was tested for versions 0.4, 0.6-SNAPSHOT, and 0.5.3 from > com.argonio.gora) > > 4. Avro 1.7.6 (also played with versions 1.7.4, 1.7.7) > 4.1 svn > 4.2 patched for AVRO-813[2] > 4.3 patched for AVRO-882[3] and rollbacked > 4.4 patched as mentioned in [4] - commented throwing EOFException against > org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473), etc. > > After investigating numerous exceptions in many weeks, a number of changes > have been made in the code Nutch 2.x and Avro 1.7.6 to suppress > exceptions and walk a little further. We got some success, Nutch looks like > a bit of running, but is unstable and incorrect. All necessary (for us) > stages pass in cycle (inject, generate, fetch, parse, updatedb). But some > functionalities are broken and ignored. > It seems that because of the poor Nutch/Hadoop/HBase experience, we broke > the normal data exchange between Nutch and HBase (also with gora and avro). > Perhaps some of the fields (and/or some of the data formats) read and write > incorrectly. For example, many markers are lost and temporary emulated in > code to pass through the steps; data in batchId field are lost; scoring is > broken also. > > Please help us! Perhaps there are somewhere the necessary working > assemblies and/or scripts and patches. Maybe someone has a positive > experience in this. I'm ready to publish all my diffs and exception traces. > Also, I would be very grateful if someone would tell me when we can get a > new of Nutch 2.3 release; it seems that it will be Hadoop2-compatible. > > [1] http://hbase.apache.org/book/configuration.html > [2] https://issues.apache.org/jira/browse/AVRO-813 > [3] https://issues.apache.org/jira/browse/AVRO-882 > http://mail-archives.apache.org/mod_mbox/avro-user/201108.mbox/%3ccaanh3_9_cqqbmt4vqyzg8-ikfo4nnlpcuzbbwd4kqoavpek...@mail.gmail.com%3E > [4] > http://mail-archives.apache.org/mod_mbox/nutch-user/201409.mbox/%3cCAEmTxX9HrRM00SxerFAdRdZy=wVAd9xCchDTuLaxPQ=wi0q...@mail.gmail.com%3e > [5] > http://stackoverflow.com/questions/13946725/configuring-hbase-standalone-mode-with-apache-nutch-java-lang-illegalargumente > https://github.com/dobromyslov > > BR, > Alex Median

