Can't run Nutch2 on Hadoop2 (Nutch 2.x + Hadoop 2.4.0 + HBase 0.94.18 + Gora 0.5 + Avro 1.7.6)

Alex Median Fri, 03 Oct 2014 06:04:19 -0700

Hi,

Within a month I'm in the process of installing Nutch 2.3 in this
configuration (subj).
Nutch 2 initially with Hadoop 1 was chosen a few months ago, some of the
coding is already done.
We chose Amazon AWS Elastic MapReduce (EMR) as a platform.
Unfortunately EMR Hadoop 1 version on an old Debian does not suit us.
Therefore, we need to establish exactly Nutch 2 in the above configuration:
Hadoop 2.4.0 + HBase 0.94.18 (Amazon Linux: AMI version:3.2.1, Hadoop
distribution:Amazon 2.4.0, Applications:HBase 0.94.18)


But info about another experiences with Nutch2+hadoop2 will also good..

What has been done for the last iteration of the installation on local
computer:

1. Nutch 2.x
1.1 svn current 2.x version
1.2. prepared scripts:
1.2.1 ivy:
<dependency org="org.apache.hadoop" name="hadoop-common" rev="2.4.0">..
<dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core"
rev="2.4.0">..
<dependency org="org.apache.gora" name="gora" rev="0.5" conf="*->default" />
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5"
conf="*->default" />
etc.
1.2.2 default.properties:
hadoop.version=2.4.0
version=2.3-SNAPSHOT
etc.
1.3. added public int getFieldsCount() { return Field.values().length; } to
ProtocolStatus.java, ParseStatus.java, Host.java, WebPage.java.

2. HBase
2.1 svn HBase 0.94.18
2.2 prepared for Protobuf 2.5.0 [1], also thanks to Dobromyslov [5]
2.3 also generated hbase-0.94.18-hadoop-2.4.0.jar

3. Gora 0.5 (also was tested for versions 0.4, 0.6-SNAPSHOT, and 0.5.3 from
com.argonio.gora)

4. Avro 1.7.6 (also played with versions 1.7.4, 1.7.7)
4.1 svn
4.2 patched for AVRO-813[2]
4.3 patched for AVRO-882[3] and rollbacked
4.4 patched as mentioned in [4] - commented throwing EOFException against
org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473), etc.

After investigating numerous exceptions in many weeks, a number of changes
have been made in the code Nutch 2.x and Avro 1.7.6 to suppress
exceptions and walk a little further. We got some success, Nutch looks like
a bit of running, but is unstable and incorrect. All necessary (for us)
stages pass in cycle (inject, generate, fetch, parse, updatedb). But some
functionalities are broken and ignored.
It seems that because of the poor Nutch/Hadoop/HBase experience, we broke
the normal data exchange between Nutch and HBase (also with gora and avro).
Perhaps some of the fields (and/or some of the data formats) read and write
incorrectly. For example, many markers are lost and temporary emulated in
code to pass through the steps; data in batchId field are lost; scoring is
broken also.

Please help us! Perhaps there are somewhere the necessary working
assemblies and/or scripts and patches. Maybe someone has a positive
experience in this. I'm ready to publish all my diffs and exception traces.
Also, I would be very grateful if someone would tell me when we can get a
new of Nutch 2.3 release; it seems that it will be Hadoop2-compatible.

[1] http://hbase.apache.org/book/configuration.html
[2] https://issues.apache.org/jira/browse/AVRO-813
[3] https://issues.apache.org/jira/browse/AVRO-882
http://mail-archives.apache.org/mod_mbox/avro-user/201108.mbox/%3ccaanh3_9_cqqbmt4vqyzg8-ikfo4nnlpcuzbbwd4kqoavpek...@mail.gmail.com%3E
[4]
http://mail-archives.apache.org/mod_mbox/nutch-user/201409.mbox/%3cCAEmTxX9HrRM00SxerFAdRdZy=wVAd9xCchDTuLaxPQ=wi0q...@mail.gmail.com%3e
[5]
http://stackoverflow.com/questions/13946725/configuring-hbase-standalone-mode-with-apache-nutch-java-lang-illegalargumente
https://github.com/dobromyslov

BR,
Alex Median

Can't run Nutch2 on Hadoop2 (Nutch 2.x + Hadoop 2.4.0 + HBase 0.94.18 + Gora 0.5 + Avro 1.7.6)

Reply via email to