[jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce

Vivek Ratan (JIRA) Wed, 31 Oct 2007 04:18:12 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539065
 ]


Vivek Ratan commented on HADOOP-1986:
-------------------------------------

>> Why must you have a singleton serializer instance that handles more than one 
>> class? 

For many reasons. An easy one I can think of is that serializer instances can 
have state (an input or output stream, that they keep open across each 
serialization, for example). We've been talking about stateful serializers 
earlier in this dicsussion and it seems to me like it's quite possible we'll 
associate state with serializers for performance. Let's say you have a key 
class and a value class, both of which are generated from the Record I?O 
compiler, so that both inherit from the Record base class. And let's say you 
want to serialize a number of keys and values into a file: a key, followed by a 
value, followed by another key, and so on. If you have a separate serializer 
instance for each of the key adn value class, they need to share the same 
OutputStream object to the file you serialize them to. Having one serializer 
instance that handles both keys and values (since they're both Records) will be 
cleaner and easier. Its also quite possible that we have serialization 
platforms that contain other states (maybe they use some libraries that need to 
be initialized once, for example). So forcing people to not create serializers 
for more than one class seems restrictive. The choice of whether the 
serialization platform shares an instance across multiple classes should be 
left to the platform. 

>> So would clients like SequenceFile and the mapreduce shuffle require 
>> different code to deserialize different classes? We need to have generic 
>> client code.
Yes, and that is the fundamental tradeoff. The flip side of what I'm suggesting 
is that the client has to write separate code for two kinds of serializers. 
That's not great, but I'm arguing that that is better than restricting the kind 
of serialization platforms we use, or restricting how we use them. The client 
will have to write something like: 
{code}
if (serializer.acceptObjectReference()) {
  <some Class> o = new <some Class>();
  serializer.deserialize(o);
  ...
}
else {
  <some Class> o = serializer.deserialize();
  ...
}
{code}

Yeah, it's not great, but it's not so bad either, compared to forcing 
serialization platforms to not create shared serializers. But it is a tradeoff. 
If folks think we're OK forcing serialization platforms to not share serializer 
instances across classes, resulting in cleaner client code, then that's fine. I 
personally would choose the opposite. But I hope the tradeoff and the pros and 
cons are clear. 

>> Again, I don't see why Record I/O, where we control the code generation from 
>> an IDL, cannot generate a no-arg ctor. Similarly for Thrift. The ctor does 
>> not have to be public. We already bypass protections when we create 
>> instances.
Well, yes for Thrift and Record I/O but maybe not so for some other platform we 
may want to support in the future (and whose code we cannot control). And 
besides, no-arg constructors are not the main reason for supporting a single 
deserialize method, singleton serializers are. 



> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.16.0
>
>         Attachments: SerializableWritable.java, serializer-v1.patch
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable 
> key-value pairs. While it's possible to write Writable wrappers for other 
> serialization frameworks (such as Thrift), this is not very convenient: it 
> would be nicer to be able to use arbitrary types directly, without explicit 
> wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce

Reply via email to