[ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931486#action_12931486 ]
Doug Cutting commented on HADOOP-6685: -------------------------------------- > It includes support for Avro, Thrift, ProtocolBuffers, Writables, Java > serialization, and an adaptor for the old style serializations. > All of the types can be put into SequenceFiles, MapFiles, > BloomFilterMapFiles, SetFile, and ArrayFile. Could you please explain the motivation for extending these file formats to support all of these serialization systems? The patch changes the APIs for these classes, deprecating methods and adding new methods to support new serializations. We know from experience that changing APIs has a cost, so we ought to justify that cost. To my thinking, a priority for the project is to support file formats that can be processed by other programming languages. Avro, Thrift and ProtocolBuffers are implemented in other languages, but SequenceFile, MapFile, BloomFilterMapFile, SetFile, ArrayFile and TFile are not. Unless we intend to implement these formats in a variety of other programming languages, I don't see a big advantage of supporting so many different serialization systems from Java only. It doesn't greatly increase the expressive power available to Java developers, and the added variety introduces more potential support issues. It would be useful if the shuffle could process things besides Writable (MAPREDUCE-1126) and it would be useful to have InputFormats and OutputFormats for language-independent file formats like Avro's (MAPREDUCE-815). Much of this patch seems like it could help implement these, but parts of it (e.g., the metadata serialization, enhancements to SequenceFile, etc.) don't seem relevant to these goals. I don't see supporting multiple Java serialization APIs as a goal in and of itself. > Change the generic serialization framework API to use serialization-specific > bytes instead of Map<String,String> for configuration > ---------------------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-6685 > URL: https://issues.apache.org/jira/browse/HADOOP-6685 > Project: Hadoop Common > Issue Type: Improvement > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Attachments: serial.patch > > > Currently, the generic serialization framework uses Map<String,String> for > the serialization specific configuration. Since this data is really internal > to the specific serialization, I think we should change it to be an opaque > binary blob. This will simplify the interface for defining specific > serializations for different contexts (MAPREDUCE-1462). It will also move us > toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.