Hi Alex,

> But info about another experiences with Nutch2+hadoop2 will also good..

I set up Nutch 2.3 + CDH 4.7 (HBase 0.94, Hadoop 2.0 etc) a few months
ago, and it's working fine.

I used the latest code from svn with no modifications, and followed
the tutorial below:
http://wiki.apache.org/nutch/Nutch2Tutorial

HTH,
Kaz

2014-10-03 22:03 GMT+09:00 Alex Median <[email protected]>:
>
> Hi,
>
> Within a month I'm in the process of installing Nutch 2.3 in this
> configuration (subj).
> Nutch 2 initially with Hadoop 1 was chosen a few months ago, some of the
> coding is already done.
> We chose Amazon AWS Elastic MapReduce (EMR) as a platform.
> Unfortunately EMR Hadoop 1 version on an old Debian does not suit us.
> Therefore, we need to establish exactly Nutch 2 in the above configuration:
> Hadoop 2.4.0 + HBase 0.94.18 (Amazon Linux: AMI version:3.2.1, Hadoop
> distribution:Amazon 2.4.0, Applications:HBase 0.94.18)
>
> But info about another experiences with Nutch2+hadoop2 will also good..
>
> What has been done for the last iteration of the installation on local
> computer:
>
> 1. Nutch 2.x
> 1.1 svn current 2.x version
> 1.2. prepared scripts:
> 1.2.1 ivy:
> <dependency org="org.apache.hadoop" name="hadoop-common" rev="2.4.0">..
> <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core"
> rev="2.4.0">..
> <dependency org="org.apache.gora" name="gora" rev="0.5" conf="*->default" />
> <dependency org="org.apache.gora" name="gora-hbase" rev="0.5"
> conf="*->default" />
> etc.
> 1.2.2 default.properties:
> hadoop.version=2.4.0
> version=2.3-SNAPSHOT
> etc.
> 1.3. added public int getFieldsCount() { return Field.values().length; } to
> ProtocolStatus.java, ParseStatus.java, Host.java, WebPage.java.
>
> 2. HBase
> 2.1 svn HBase 0.94.18
> 2.2 prepared for Protobuf 2.5.0 [1], also thanks to Dobromyslov [5]
> 2.3 also generated hbase-0.94.18-hadoop-2.4.0.jar
>
> 3. Gora 0.5 (also was tested for versions 0.4, 0.6-SNAPSHOT, and 0.5.3 from
> com.argonio.gora)
>
> 4. Avro 1.7.6 (also played with versions 1.7.4, 1.7.7)
> 4.1 svn
> 4.2 patched for AVRO-813[2]
> 4.3 patched for AVRO-882[3] and rollbacked
> 4.4 patched as mentioned in [4] - commented throwing EOFException against
> org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473), etc.
>
> After investigating numerous exceptions in many weeks, a number of changes
> have been made in the code Nutch 2.x and Avro 1.7.6 to suppress
> exceptions and walk a little further. We got some success, Nutch looks like
> a bit of running, but is unstable and incorrect. All necessary (for us)
> stages pass in cycle (inject, generate, fetch, parse, updatedb). But some
> functionalities are broken and ignored.
> It seems that because of the poor Nutch/Hadoop/HBase experience, we broke
> the normal data exchange between Nutch and HBase (also with gora and avro).
> Perhaps some of the fields (and/or some of the data formats) read and write
> incorrectly. For example, many markers are lost and temporary emulated in
> code to pass through the steps; data in batchId field are lost; scoring is
> broken also.
>
> Please help us! Perhaps there are somewhere the necessary working
> assemblies and/or scripts and patches. Maybe someone has a positive
> experience in this. I'm ready to publish all my diffs and exception traces.
> Also, I would be very grateful if someone would tell me when we can get a
> new of Nutch 2.3 release; it seems that it will be Hadoop2-compatible.
>
> [1] http://hbase.apache.org/book/configuration.html
> [2] https://issues.apache.org/jira/browse/AVRO-813
> [3] https://issues.apache.org/jira/browse/AVRO-882
> http://mail-archives.apache.org/mod_mbox/avro-user/201108.mbox/%3ccaanh3_9_cqqbmt4vqyzg8-ikfo4nnlpcuzbbwd4kqoavpek...@mail.gmail.com%3E
> [4]
> http://mail-archives.apache.org/mod_mbox/nutch-user/201409.mbox/%3cCAEmTxX9HrRM00SxerFAdRdZy=wVAd9xCchDTuLaxPQ=wi0q...@mail.gmail.com%3e
> [5]
> http://stackoverflow.com/questions/13946725/configuring-hbase-standalone-mode-with-apache-nutch-java-lang-illegalargumente
> https://github.com/dobromyslov
>
> BR,
> Alex Median

Reply via email to