Serialization framework use SequenceFile/TFile/Other metadata to instantiate
deserializer
-----------------------------------------------------------------------------------------
Key: HADOOP-4243
URL: https://issues.apache.org/jira/browse/HADOOP-4243
Project: Hadoop Core
Issue Type: Improvement
Components: contrib/serialization
Reporter: Pete Wyckoff
SequenceFile metadata is useful for storing additional information about the
serialized data, for example, for RecordIO, whether the data is CSV or Binary.
For thrift, the same thing - Binary, JSON, ...
For Hive, this may be especially important, because it has a Dynamic generic
serializer/deserializer that takes its DDL at runtime (as opposed to RecordIO
and Thrift which require pre-compilation into a specific class whose name can
be stored in the sequence file key or value class). In this case, the class
name is like Record.java in RecordIO - it doesn't tell you anything without the
DDL.
One way to address this could be adding the sequence file metadata to the
getDeserializer call in Serialization interface. The api would then be
something like getDeserializer(Class<?>, Map<Text, Text> metadata) or
Properties metadata.
But, I am open to proposals.
This also means that saying a class implements Writable is not enough to
necessarily deserialize it since it may do specific actions based on the
metadata - e.g., RecordIO might determine whether to use CSV rather than the
default Binary deserialization.
There's the other issue of the getSerializer returning the metadata to be
written to the Sequence/T File.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.