[
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532684
]
Vivek Ratan commented on HADOOP-1986:
-------------------------------------
I'm reading this discussion in a slightly different way. [I'm deviating a bit
from the discussion that has been going on, but I will address that in the end
]. The main issue seems to be (or perhaps, should be), how do we use different
serializers/deserializers within Hadoop (Thrift, Record I/O,
java.io.serializable, whatever - I refer to them as serializing platforms). One
clear benefit for supporting various options is to avoid having people write
their own code to serialize/deserialize and let it be handled automatically, no
matter what object is being serialized/deserialized. Seems to me like everyone
agrees that we need (or at least we should think about) a way for Hadoop to
support multiple serializing platforms and automatic
serialization/deserialization.
How do we do it? There are typically two steps, or two issues, to handle when
serializing/deserializing - how does one walk through the member variables of
an arbitrary object (which themselves can be objects) till we're down to basic
types, and how does one serialize/deserialize the basic types.
The former you typically do in two ways: you use a DDL/IDL to statically
generate stubs which contain generated code that walks through objects of a
particular type, or you use a generic 'walker' that can dynamically walk
through any object (I can only think of using reflection to do this, so this
approach seems restricted to languages that support reflection). Thrift and
Record I/O both use DDLs, mostly because they need to support languages that do
not have reflection, while Java's serialization uses reflection.
For the second step (that of serializing/deserializing basic types),
serialization platforms like Thrift and Record I/O provide support for
serializing/deserializing basic types using a protocol (binary, text, etc) and
a stream (file, socket, etc). Same with Java. This is typically invoked through
an interface which contains methods to read/write basic types into a stream.
Now, how does all this apply to Hadoop? I'm thinking about serialization not
just for key-value pairs for Map/Reduce, but also in other places - Hadoop RPC,
reading & writing data into HDFS, etc. For walking through an arbitrary object
in Hadoop, one option is to modify compilers of Record I/O or Thrift to spit
out Hadoop-compatible classes (which implement Writable, for example). A better
option, since Hadoop's written in Java, is to have a generic walker that uses
reflection to walk through any object. A user would invoke a generic Hadoop
serializer class, which in turn would call a walker object or itself walk
through any object using reflection.
{code}
class HadoopSerializer {
static public init(...); // initialize with the serialization platform you
prefer
static void serialize(Object, OutputStream);
static void deserialize(Object, InputStream);
}
{code}
This walker object (for lack of a better name) needs to now link to a
serialization platform of choice - Record I/O, Thrift whatever. All it does is
invoke the serialize/deserialize for individual types. In
HadoopSerializer:init(), the user can say what platform they want to use, along
with what format and what transport, or some such thing. or you could make
HadoopSerializer an interface, have implementations for each serialization
platform we support, and let the user configure that when running a job. Now,
for the walker object to be invoke any of the serialization platforms, the
platforms probably need to implement a common interface (which contains methods
for serializing/deserializing individual types such as ints or longs into a
stream). The other option is to register classes for each type (similar to what
has been discussed earlier), but this, IMO, is a pain as most people probably
want to use the same platform for all types. You probably would not want to
serialize ints using Record I/O and strings using Thrift. So, in order to
integrate a serialization platform into Hadoop, we'll need a wrapper for each
platform which implements the generic interface invoked by the walker. This is
quite easy to do - the wrapper is likely to be thin and will simply delegate to
the appropriate Thrift or Record I/O object. having a base class for Thrift
would make the Thrift wrapper for hadoop quite simple, but we can probably
still implement a wrapper without that base class.
Let me get back to the existing discussion now. Seems to me like we haven't
addressed the bigger issue of how to make it easy on a user to
serialize/deserialize data, and that we're instead just moving around the
functionality (and I don't mean that in a dismissive way). I don't think you
want a serializer/deserializer per class. Someone still needs to implement the
code for serializing/deserializing that class and I don't see any discussion on
Hadoop support for Thrift or Record which the user can just invoke. plus, if
you think of using this mechanism for Hadoop RPC, we will have so many
instances of the Serializer<T> interface. You're far better off having a
HadoopSerializer class that takes in any object and automatically
serializes/deserializes it. All a user has to do is decide which serialization
platform to use. There is a bit of one-time work in integrating a platform into
Hadoop (writing the wrapper class that implements the interface called by
HadoopWalker), but it's not much. What I'm suggesting also matches, I think,
Joydeep's experience with using Thrift with HDFS.
Or maybe I've completely gone on a tangent. I've been neck deep in Record I/O
code and everything looks like a serializable Jute Record to me.
> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
> Key: HADOOP-1986
> URL: https://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable
> key-value pairs. While it's possible to write Writable wrappers for other
> serialization frameworks (such as Thrift), this is not very convenient: it
> would be nicer to be able to use arbitrary types directly, without explicit
> wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.