[ 
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532684
 ] 

Vivek Ratan commented on HADOOP-1986:
-------------------------------------

I'm reading this discussion in a slightly different way. [I'm deviating a bit 
from the discussion that has been going on, but I will address that in the end 
]. The main issue seems to be (or perhaps, should be), how do we use different 
serializers/deserializers within Hadoop (Thrift, Record I/O, 
java.io.serializable, whatever - I refer to them as serializing platforms). One 
clear benefit for supporting various options is to avoid having people write 
their own code to serialize/deserialize and let it be handled automatically, no 
matter what object is being serialized/deserialized. Seems to me like everyone 
agrees that we need (or at least we should think about) a way for Hadoop to 
support multiple serializing platforms and automatic 
serialization/deserialization. 

How do we do it? There are typically two steps, or two issues, to handle when 
serializing/deserializing - how does one walk through the member variables of 
an arbitrary object (which themselves can be objects) till we're down to basic 
types, and how does one serialize/deserialize the basic types. 

The former you typically do in two ways: you use a DDL/IDL to statically 
generate stubs which contain generated code that walks through objects of a 
particular type, or you use a generic 'walker' that can dynamically walk 
through any object (I can only think of using reflection to do this, so this 
approach seems restricted to languages that support reflection). Thrift and 
Record I/O both use DDLs, mostly because they need to support languages that do 
not have reflection, while Java's serialization uses reflection. 

For the second step (that of serializing/deserializing basic types), 
serialization platforms like Thrift and Record I/O provide support for 
serializing/deserializing basic types using a protocol (binary, text, etc) and 
a stream (file, socket, etc). Same with Java. This is typically invoked through 
an interface which contains methods to read/write basic types into a stream. 

Now, how does all this apply to Hadoop? I'm thinking about serialization not 
just for key-value pairs for Map/Reduce, but also in other places - Hadoop RPC, 
reading & writing data into HDFS, etc. For walking through an arbitrary object 
in Hadoop, one option is to modify compilers of Record I/O or Thrift to spit 
out Hadoop-compatible classes (which implement Writable, for example). A better 
option, since Hadoop's written in Java, is to have a generic walker that uses 
reflection to walk through  any object. A user would invoke a generic Hadoop 
serializer class, which in turn would call a walker object or itself walk 
through any object using reflection. 
{code}
class HadoopSerializer {
  static public init(...);  // initialize with the serialization platform you 
prefer
  static void serialize(Object, OutputStream);
  static void deserialize(Object, InputStream);
}
{code}

This walker object (for lack of a better name) needs to now link to a 
serialization platform of choice - Record I/O, Thrift whatever. All it does is 
invoke the serialize/deserialize for individual types. In 
HadoopSerializer:init(), the user can say what platform they want to use, along 
with what format and what transport, or some such thing. or you could make 
HadoopSerializer an interface, have implementations for each serialization 
platform we support, and let the user configure that when running a job. Now, 
for the walker object to be invoke any of the serialization platforms, the 
platforms probably need to implement a common interface (which contains methods 
for serializing/deserializing individual types such as ints or longs into a 
stream). The other option is to register classes for each type (similar to what 
has been discussed earlier), but this, IMO, is a pain as most people probably 
want to use the same platform for all types. You probably would not want to 
serialize ints using Record I/O and strings using Thrift. So, in order to 
integrate a serialization platform into Hadoop, we'll need a wrapper for each 
platform which implements the generic interface invoked by the walker. This is 
quite easy to do - the wrapper is likely to be thin and will simply delegate to 
the appropriate Thrift or Record I/O object. having a base class for Thrift 
would make the Thrift wrapper for hadoop quite simple, but we can probably 
still implement a wrapper without that base class. 

Let me get back to the existing discussion now. Seems to me like we haven't 
addressed the bigger issue of how to make it easy on a user to 
serialize/deserialize data, and that we're instead just moving around the 
functionality (and I don't mean that in a dismissive way). I don't think you 
want a serializer/deserializer per class. Someone still needs to implement the 
code for serializing/deserializing that class and I don't see any discussion on 
Hadoop support for Thrift or Record which the user can just invoke. plus, if 
you think of using this mechanism for Hadoop RPC, we will have so many 
instances of the Serializer<T> interface. You're far better off having a 
HadoopSerializer class that takes in any object and automatically 
serializes/deserializes it. All a user has to do is decide which serialization 
platform to use. There is a bit of one-time work in integrating a platform into 
Hadoop (writing the wrapper class that implements the interface called by 
HadoopWalker), but it's not much. What I'm suggesting also matches, I think, 
Joydeep's experience with using Thrift with HDFS. 

Or maybe I've completely gone on a tangent. I've been neck deep in Record I/O 
code and everything looks like a serializable Jute Record to me. 



> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable 
> key-value pairs. While it's possible to write Writable wrappers for other 
> serialization frameworks (such as Thrift), this is not very convenient: it 
> would be nicer to be able to use arbitrary types directly, without explicit 
> wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to