Re: Python + hdfs written thrift sequence files: lots of moving parts!

Harsh J Tue, 25 Sep 2012 11:24:39 -0700

Hi Jay,

This may be off-topic to you, but I feel its related: Use Avro
DataFiles. There's Python support already available, as well as
several other languages.


On Tue, Sep 25, 2012 at 10:57 PM, Jay Vyas <jayunit...@gmail.com> wrote:
> Hi guys!
>
> Im trying to read some hadoop outputted thrift files in plain old java
> (without using SequenceFile.Reader).  The reason for this is that I
>
> (1) want to understand the sequence file format better and
> (2) would like to be able to port this code to a language which doesnt have
> robust hadoop sequence file i/o / thrift support  (python). My code looks
> like this:
>
> So, before reading forward, if anyone has :
>
> 1) Some general hints on how to create a Sequence file with thrift encoded
> key values in python would be very useful.
> 2) Some tips on the generic approach for reading a sequencefile (the
> comments seem to be a bit underspecified in the SequenceFile header)
>
> I'd appreciate it!
>
> Now, here is my adventure into thrift/hdfs sequence file i/o :
>
> I've written a simple stub which , I think, should be the start of a
> sequence file reader (just tries to skip the header and get straight to the
> data).
>
> But it doesnt handle compression.
>
> http://pastebin.com/vyfgjML9
>
> So, this code ^^ appears to fail with cryptic errors : "don't know what
> type: 15".
>
> This error comes from a case statement, which attempts to determine what
> type of thrift record is being read in:
> "fail 127 don't know what type: 15"
>
>   private byte getTType(byte type) throws TProtocolException {
>     switch ((byte)(type & 0x0f)) {
>       case TType.STOP:
>         return TType.STOP;
>       case Types.BOOLEAN_FALSE:
>       case Types.BOOLEAN_TRUE:
>         return TType.BOOL;
>      ........
>      case Types.STRUCT:
>         return TType.STRUCT;
>       default:
>         throw new TProtocolException("don't know what type: " +
> (byte)(type & 0x0f));
>     }
>
> Upon further investigation, I have found that, in fact, the Configuration
> object is (of course) heavily utilized by the SequenceFile reader, in
> particular, to
> determine the Codec.  That corroborates my hypothesis that the data needs
> to be decompressed or decoded before it can be deserialized by thrift, I
> believe.
>
> So... I guess what Im assuming is missing here, is that I don't know how to
> manually reproduce the Codec/GZip, etc.. logic inside of
> SequenceFile.Reader in plain old java (i.e without cheating and using the
> SequenceFile.Reader class that is configured in our mapreduce soruce
> code).
>
> With my end goal being to read the file in python, I think it would be nice
> to be able to read the sequencefile in java, and use this as a template
> (since I know that my thrift objects and serialization are working
> correctly in my current java source codebase, when read in from
> SequenceFile.Reader api).
>
> Any suggestions on how I can distill the logic of the SequenceFile.Reader
> class into a simplified version which is specific to my data, so that I can
> start porting into a python script which is capable of scanning a few real
> sequencefiles off of HDFS would be much appreciated !!!
>
> In general... what are the core steps for doing i/o with sequence files
> that are compressed and or serialized in different formats?  Do we
> decompress first , and then deserialize?  Or do them both at the same time
> ?  Thanks!
>
> PS I've added an issue to github here
> https://github.com/matteobertozzi/Hadoop/issues/5, for a python
> SequenceFile reader.  If I get some helpful hints on this thread maybe I
> can directly implement an example on matteobertozzi's python hadoop trunk.
>
> --
> Jay Vyas
> MMSB/UCHC



-- 
Harsh J

Re: Python + hdfs written thrift sequence files: lots of moving parts!

Reply via email to