Re: Can't run Nutch2 on Hadoop2 (Nutch 2.x + Hadoop 2.4.0 + HBase 0.94.18 + Gora 0.5 + Avro 1.7.6)

Alex Median Thu, 09 Oct 2014 09:45:36 -0700

Hi,

Now I'll post all my diffs and exception traces. Also I'd note that I
didn't use a common environment for multiple projects because I've problem
with it. After trouble with default build of Nutch 2.x I just built
separate projects for Nutch, Gora, HBase and Avro. After compiling I just
put the jars to the appropriate lib folder. Maybe I'm wrong.
There is a short additional description to my first message, more info in
attacment.
Note, that initially injected url is http://www.paulgraham.com/lispart.html

1. After installing the first build of Nutch 2.3 (p. 1. of my first
message: fm1.) with Hadoop 2.4.0 and HBase 0.94.18 prepared as in fm2. Nuth
is hanging. So, I put hbase-0.94.18-hadoop-2.4.0.jar (from fm2.3.) into
nutch lib instead of hbase-0.94.18.jar

2. After that inject phase falls with:

java.lang.Exception: java.lang.IncompatibleClassChangeError: Found
interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was
expected

So, I put gora-core-0.5.jar, gora-hbase-0.5.jar,
gora-shims-distribution-0.5.jar externally compiled with Hadoop 2.4.0 to
nutch lib (it seems in default maven repository it compiled with Hadoop 1)

3.After that the inject phase passed, but generate phase falls with:

java.lang.Exception: java.io.EOFException
    at ...
Caused by: java.io.EOFException
    at org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473)
...

There are a number of another exceptions were thrown consequentially during
my Avro 1.7.6 code modifications.
So, finally I changed the code of Avro significantly as shown in diffs
attached (as in fm4.2. - fm4.4. and more). Probably I'm wrong there.
I put compiled jars to the Nutch lib. Also I tested both scenario with and
without new compiled avro 1.7.6 jars in HBase 0.94.18 lib folder instead of
original (avro-1.5.3.jar); it seems to be the same situation.

4. After that the inject, generate, fetch, parse and updatedb phases
passed, but parse passed with message "Skipping
http://www.paulgraham.com/lispart.html; not fetched yet" and updatedb
passed, but "Skipping http://www.paulgraham.com/lispart.html; not generated
yet". It seems, that it caused by lack of any previous marker in HBase DB
as shown in HBase shell result report attached.
So, I changed the markers checks in the Nutch code and in case of null it
emulated in code as "1" (it have to be equal batchId), see diffs attached.
Now it seems I'm loosing crawling control by batchId.

5. After that the inject, generate, fetch, parse phases passed, but
updatedb phase falls with:

java.io.EOFException
    at
org.apache.avro.util.ByteBufferInputStream.getBuffer(ByteBufferInputStream.java:86)
    at
org.apache.avro.util.ByteBufferInputStream.read(ByteBufferInputStream.java:48)
...

It seems that it generates a number of new urls but falls actually on the
first page processing.
Playing with ByteBufferInputStream just generates new exceptions.
So, in order to go a bit further I just commented all processing of the
first page as shown in diffs attached. Now I'm loosing main part of Nutch
functionality such as scoring, etc.

6. After that the all phases passed for 1 cycle, 55 row(s) generated.

7. After that on the second cycle the inject phase passed, but generate
phase falls with:

java.lang.ArrayIndexOutOfBoundsException: 8192
    at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:131)
    at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:431)
 ...

I just repeate the generate phase and it passed.

8. After that the all phases passed for the second cycle, 580 row(s)
generated. Further cycles passed but I didn't tested enough yet.

Apparently it is not normal working of Nutch 2.3 on Hadoop 2.4.0.

Where I'm wrong? Can anybody help me with right build scripts and
instructions?

BR,
Alex Median

On Fri, Oct 3, 2014 at 2:44 PM, Alex Median <[email protected]> wrote:

>
> Hi,
>
> Within a month I'm in the process of installing Nutch 2.3 in this
> configuration (subj).
> Nutch 2 initially with Hadoop 1 was chosen a few months ago, some of the
> coding is already done.
> We chose Amazon AWS Elastic MapReduce (EMR) as a platform.
> Unfortunately EMR Hadoop 1 version on an old Debian does not suit us.
> Therefore, we need to establish exactly Nutch 2 in the above
> configuration:
> Hadoop 2.4.0 + HBase 0.94.18 (Amazon Linux: AMI version:3.2.1, Hadoop
> distribution:Amazon 2.4.0, Applications:HBase 0.94.18)
>
> But info about another experiences with Nutch2+hadoop2 will also good..
>
> What has been done for the last iteration of the installation on local
> computer:
>
> 1. Nutch 2.x
> 1.1 svn current 2.x version
> 1.2. prepared scripts:
> 1.2.1 ivy:
> <dependency org="org.apache.hadoop" name="hadoop-common" rev="2.4.0">..
> <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core"
> rev="2.4.0">..
> <dependency org="org.apache.gora" name="gora" rev="0.5" conf="*->default"
> />
> <dependency org="org.apache.gora" name="gora-hbase" rev="0.5"
> conf="*->default" />
> etc.
> 1.2.2 default.properties:
> hadoop.version=2.4.0
> version=2.3-SNAPSHOT
> etc.
> 1.3. added public int getFieldsCount() { return Field.values().length; }
> to ProtocolStatus.java, ParseStatus.java, Host.java, WebPage.java.
>
> 2. HBase
> 2.1 svn HBase 0.94.18
> 2.2 prepared for Protobuf 2.5.0 [1], also thanks to Dobromyslov [5]
> 2.3 also generated hbase-0.94.18-hadoop-2.4.0.jar
>
> 3. Gora 0.5 (also was tested for versions 0.4, 0.6-SNAPSHOT, and 0.5.3
> from com.argonio.gora)
>
> 4. Avro 1.7.6 (also played with versions 1.7.4, 1.7.7)
> 4.1 svn
> 4.2 patched for AVRO-813[2]
> 4.3 patched for AVRO-882[3] and rollbacked
> 4.4 patched as mentioned in [4] - commented throwing EOFException against
> org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473), etc.
>
> After investigating numerous exceptions in many weeks, a number of changes
> have been made in the code Nutch 2.x and Avro 1.7.6 to suppress
> exceptions and walk a little further. We got some success, Nutch looks like
> a bit of running, but is unstable and incorrect. All necessary (for us)
> stages pass in cycle (inject, generate, fetch, parse, updatedb). But some
> functionalities are broken and ignored.
> It seems that because of the poor Nutch/Hadoop/HBase experience, we broke
> the normal data exchange between Nutch and HBase (also with gora and avro).
> Perhaps some of the fields (and/or some of the data formats) read and
> write incorrectly. For example, many markers are lost and temporary
> emulated in code to pass through the steps; data in batchId field are lost;
> scoring is broken also.
>
> Please help us! Perhaps there are somewhere the necessary working
> assemblies and/or scripts and patches. Maybe someone has a positive
> experience in this. I'm ready to publish all my diffs and exception traces.
> Also, I would be very grateful if someone would tell me when we can get a
> new of Nutch 2.3 release; it seems that it will be Hadoop2-compatible.
>
> [1] http://hbase.apache.org/book/configuration.html
> [2] https://issues.apache.org/jira/browse/AVRO-813
> [3] https://issues.apache.org/jira/browse/AVRO-882
> http://mail-archives.apache.org/mod_mbox/avro-user/201108.mbox/%3ccaanh3_9_cqqbmt4vqyzg8-ikfo4nnlpcuzbbwd4kqoavpek...@mail.gmail.com%3E
> [4]
> http://mail-archives.apache.org/mod_mbox/nutch-user/201409.mbox/%3cCAEmTxX9HrRM00SxerFAdRdZy=wVAd9xCchDTuLaxPQ=wi0q...@mail.gmail.com%3e
> [5]
> http://stackoverflow.com/questions/13946725/configuring-hbase-standalone-mode-with-apache-nutch-java-lang-illegalargumente
> https://github.com/dobromyslov
>
> BR,
> Alex Median
>

Re: Can't run Nutch2 on Hadoop2 (Nutch 2.x + Hadoop 2.4.0 + HBase 0.94.18 + Gora 0.5 + Avro 1.7.6)

Reply via email to