[ 
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532269
 ] 

Joydeep Sen Sarma commented on HADOOP-1986:
-------------------------------------------

i have been working on putting thrift structs into Hdfs. I have been a happy 
camper so far (at least as far as hadoop/hdfs are concerned). Just for 
reference - this is what it ended up looking like:

- use BytesWritable to wrap thrift structs (and store the same in sequencefiles)
- for writing structs - i haven't had to allocate TTransport and TProtocol 
objects everytime. Resetting the buffer in a ByteArrayOutputStream works. i 
expect similar strategy to work for reading (might need to extend 
ByteArrayInputStream)
- as far as invoking the right serializer/deserializer - it's easy to do this 
with Reflection:
   * When loading data into  hdfs - the name of the thrift class is encoded 
before the serialized struct. (this is a property of TTransport). The function 
signatures for serialize/deserialize are constant allowing easy use of 
reflection
   * Data is loaded into hdfs in a way that also allows us to know the 
class-name for any serialized struct - so again we use reflection to 
deserialize while processing. (There are different ways of arranging this).

Of course - if the data is homogenous - then reflection is not required. But 
that's not the case with our data set. 

I haven't found the Writable interface to be a serious constraint in any way. 
There are some inefficiencies in the above process:
1. the use of extra length field from using BytesWritable
2. the use of reflection (don't know what the overhead's like)

But i am not sure these are big burdens. (I don't even understand how to 
conceivably avoid #2). 

One of the good things about Thrift is cross-language code generation. What we 
would do at some point is allow Python (or perl) code to work on Binary 
serialized data. The Streaming library already seems to allow this (will pass 
BytesWritable key,values to external map-reduce handlers) - where the byte 
array can be deserialized by thrift generated python deserializer in the python 
mapper.

As far as the discussions on this Thread go - i am not sure what the final 
proposal is. One thing i would opine is that the map-red job is best placed to 
understand what serialization library to invoke (instead of the map-reduce 
infrastructure). For example - we are segregating different data types in 
different files - and the thrift class is implicit in the path name (which is 
made available as a key). If i understand it correctly - Pig takes a similar 
stance (input path is mapped to a serialization library). 

> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable 
> key-value pairs. While it's possible to write Writable wrappers for other 
> serialization frameworks (such as Thrift), this is not very convenient: it 
> would be nicer to be able to use arbitrary types directly, without explicit 
> wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to