[
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532269
]
Joydeep Sen Sarma commented on HADOOP-1986:
-------------------------------------------
i have been working on putting thrift structs into Hdfs. I have been a happy
camper so far (at least as far as hadoop/hdfs are concerned). Just for
reference - this is what it ended up looking like:
- use BytesWritable to wrap thrift structs (and store the same in sequencefiles)
- for writing structs - i haven't had to allocate TTransport and TProtocol
objects everytime. Resetting the buffer in a ByteArrayOutputStream works. i
expect similar strategy to work for reading (might need to extend
ByteArrayInputStream)
- as far as invoking the right serializer/deserializer - it's easy to do this
with Reflection:
* When loading data into hdfs - the name of the thrift class is encoded
before the serialized struct. (this is a property of TTransport). The function
signatures for serialize/deserialize are constant allowing easy use of
reflection
* Data is loaded into hdfs in a way that also allows us to know the
class-name for any serialized struct - so again we use reflection to
deserialize while processing. (There are different ways of arranging this).
Of course - if the data is homogenous - then reflection is not required. But
that's not the case with our data set.
I haven't found the Writable interface to be a serious constraint in any way.
There are some inefficiencies in the above process:
1. the use of extra length field from using BytesWritable
2. the use of reflection (don't know what the overhead's like)
But i am not sure these are big burdens. (I don't even understand how to
conceivably avoid #2).
One of the good things about Thrift is cross-language code generation. What we
would do at some point is allow Python (or perl) code to work on Binary
serialized data. The Streaming library already seems to allow this (will pass
BytesWritable key,values to external map-reduce handlers) - where the byte
array can be deserialized by thrift generated python deserializer in the python
mapper.
As far as the discussions on this Thread go - i am not sure what the final
proposal is. One thing i would opine is that the map-red job is best placed to
understand what serialization library to invoke (instead of the map-reduce
infrastructure). For example - we are segregating different data types in
different files - and the thrift class is implicit in the path name (which is
made available as a key). If i understand it correctly - Pig takes a similar
stance (input path is mapped to a serialization library).
> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
> Key: HADOOP-1986
> URL: https://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable
> key-value pairs. While it's possible to write Writable wrappers for other
> serialization frameworks (such as Thrift), this is not very convenient: it
> would be nicer to be able to use arbitrary types directly, without explicit
> wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.