[ https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539065 ]
Vivek Ratan commented on HADOOP-1986: ------------------------------------- >> Why must you have a singleton serializer instance that handles more than one >> class? For many reasons. An easy one I can think of is that serializer instances can have state (an input or output stream, that they keep open across each serialization, for example). We've been talking about stateful serializers earlier in this dicsussion and it seems to me like it's quite possible we'll associate state with serializers for performance. Let's say you have a key class and a value class, both of which are generated from the Record I?O compiler, so that both inherit from the Record base class. And let's say you want to serialize a number of keys and values into a file: a key, followed by a value, followed by another key, and so on. If you have a separate serializer instance for each of the key adn value class, they need to share the same OutputStream object to the file you serialize them to. Having one serializer instance that handles both keys and values (since they're both Records) will be cleaner and easier. Its also quite possible that we have serialization platforms that contain other states (maybe they use some libraries that need to be initialized once, for example). So forcing people to not create serializers for more than one class seems restrictive. The choice of whether the serialization platform shares an instance across multiple classes should be left to the platform. >> So would clients like SequenceFile and the mapreduce shuffle require >> different code to deserialize different classes? We need to have generic >> client code. Yes, and that is the fundamental tradeoff. The flip side of what I'm suggesting is that the client has to write separate code for two kinds of serializers. That's not great, but I'm arguing that that is better than restricting the kind of serialization platforms we use, or restricting how we use them. The client will have to write something like: {code} if (serializer.acceptObjectReference()) { <some Class> o = new <some Class>(); serializer.deserialize(o); ... } else { <some Class> o = serializer.deserialize(); ... } {code} Yeah, it's not great, but it's not so bad either, compared to forcing serialization platforms to not create shared serializers. But it is a tradeoff. If folks think we're OK forcing serialization platforms to not share serializer instances across classes, resulting in cleaner client code, then that's fine. I personally would choose the opposite. But I hope the tradeoff and the pros and cons are clear. >> Again, I don't see why Record I/O, where we control the code generation from >> an IDL, cannot generate a no-arg ctor. Similarly for Thrift. The ctor does >> not have to be public. We already bypass protections when we create >> instances. Well, yes for Thrift and Record I/O but maybe not so for some other platform we may want to support in the future (and whose code we cannot control). And besides, no-arg constructors are not the main reason for supporting a single deserialize method, singleton serializers are. > Add support for a general serialization mechanism for Map Reduce > ---------------------------------------------------------------- > > Key: HADOOP-1986 > URL: https://issues.apache.org/jira/browse/HADOOP-1986 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Reporter: Tom White > Assignee: Tom White > Fix For: 0.16.0 > > Attachments: SerializableWritable.java, serializer-v1.patch > > > Currently Map Reduce programs have to use WritableComparable-Writable > key-value pairs. While it's possible to write Writable wrappers for other > serialization frameworks (such as Thrift), this is not very convenient: it > would be nicer to be able to use arbitrary types directly, without explicit > wrapping and unwrapping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.