Sergey Weiss created GORA-392:
---------------------------------

             Summary: Move PersistentSerialization to the top of serializations 
list
                 Key: GORA-392
                 URL: https://issues.apache.org/jira/browse/GORA-392
             Project: Apache Gora
          Issue Type: Improvement
          Components: gora-core
    Affects Versions: 0.5
            Reporter: Sergey Weiss


In a process of making Nutch2 run on Hadoop 2.3.0 + HBase 0.98.1 we encountered 
java.io.EOFException's like ones described in this mail thread: 
http://www.mail-archive.com/user%40nutch.apache.org/msg12644.html
We applied a patch mentioned there and got our setup running but being very 
unstable: it would fail with an ArrayIndexOutOfBounds exception whenever we try 
to generate a batch of some 50 or more pages to fetch.

We investigated the problem and discovered that in working setup of Nutch2 + 
Hadoop 1.2.0 + HBase 0.94.14, PersistentDeserializer is used for 
deserialization during reduce phase, and not 
AvroSerialization.AvroDeserializer. The reason for this sudden swap of 
deserializers lies in GoraMapReduceUtils#setIOSerializations method. It uses 
StringUtils.joinStringArrays and this method uses HashSet under the hood. Two 
more serializations were added to io.serializations property in Hadoop 2.3.0 
compared to Hadoop 1.2.0 and this results in AvroSpecificSerialization being 
placed on top of serializations list.

After we have patched GoraMapReduceUtils#setIOSerializations, having explicitly 
set PersistentSerialization to be the top of the list, we have fixed the 
problem with instability. Moreover, we don't even need to patch Avro now, just 
one simple change in Gora and everything works like a charm!

So we propose to move PersistentSerialization to the top of serializations list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to