Hi, Within a month I'm in the process of installing Nutch 2.3 in this configuration (subj). Nutch 2 initially with Hadoop 1 was chosen a few months ago, some of the coding is already done. We chose Amazon AWS Elastic MapReduce (EMR) as a platform. Unfortunately EMR Hadoop 1 version on an old Debian does not suit us. Therefore, we need to establish exactly Nutch 2 in the above configuration: Hadoop 2.4.0 + HBase 0.94.18 (Amazon Linux: AMI version:3.2.1, Hadoop distribution:Amazon 2.4.0, Applications:HBase 0.94.18)
But info about another experiences with Nutch2+hadoop2 will also good.. What has been done for the last iteration of the installation on local computer: 1. Nutch 2.x 1.1 svn current 2.x version 1.2. prepared scripts: 1.2.1 ivy: <dependency org="org.apache.hadoop" name="hadoop-common" rev="2.4.0">.. <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.4.0">.. <dependency org="org.apache.gora" name="gora" rev="0.5" conf="*->default" /> <dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" /> etc. 1.2.2 default.properties: hadoop.version=2.4.0 version=2.3-SNAPSHOT etc. 1.3. added public int getFieldsCount() { return Field.values().length; } to ProtocolStatus.java, ParseStatus.java, Host.java, WebPage.java. 2. HBase 2.1 svn HBase 0.94.18 2.2 prepared for Protobuf 2.5.0 [1], also thanks to Dobromyslov [5] 2.3 also generated hbase-0.94.18-hadoop-2.4.0.jar 3. Gora 0.5 (also was tested for versions 0.4, 0.6-SNAPSHOT, and 0.5.3 from com.argonio.gora) 4. Avro 1.7.6 (also played with versions 1.7.4, 1.7.7) 4.1 svn 4.2 patched for AVRO-813[2] 4.3 patched for AVRO-882[3] and rollbacked 4.4 patched as mentioned in [4] - commented throwing EOFException against org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473), etc. After investigating numerous exceptions in many weeks, a number of changes have been made in the code Nutch 2.x and Avro 1.7.6 to suppress exceptions and walk a little further. We got some success, Nutch looks like a bit of running, but is unstable and incorrect. All necessary (for us) stages pass in cycle (inject, generate, fetch, parse, updatedb). But some functionalities are broken and ignored. It seems that because of the poor Nutch/Hadoop/HBase experience, we broke the normal data exchange between Nutch and HBase (also with gora and avro). Perhaps some of the fields (and/or some of the data formats) read and write incorrectly. For example, many markers are lost and temporary emulated in code to pass through the steps; data in batchId field are lost; scoring is broken also. Please help us! Perhaps there are somewhere the necessary working assemblies and/or scripts and patches. Maybe someone has a positive experience in this. I'm ready to publish all my diffs and exception traces. Also, I would be very grateful if someone would tell me when we can get a new of Nutch 2.3 release; it seems that it will be Hadoop2-compatible. [1] http://hbase.apache.org/book/configuration.html [2] https://issues.apache.org/jira/browse/AVRO-813 [3] https://issues.apache.org/jira/browse/AVRO-882 http://mail-archives.apache.org/mod_mbox/avro-user/201108.mbox/%3ccaanh3_9_cqqbmt4vqyzg8-ikfo4nnlpcuzbbwd4kqoavpek...@mail.gmail.com%3E [4] http://mail-archives.apache.org/mod_mbox/nutch-user/201409.mbox/%3cCAEmTxX9HrRM00SxerFAdRdZy=wVAd9xCchDTuLaxPQ=wi0q...@mail.gmail.com%3e [5] http://stackoverflow.com/questions/13946725/configuring-hbase-standalone-mode-with-apache-nutch-java-lang-illegalargumente https://github.com/dobromyslov BR, Alex Median