Re: Python + hdfs written thrift sequence files: lots of moving parts!
Thanks harsh: In any case, I'm really curious about how it is that sequence file headers are formatted, as the documentation in the SequenceFile javadocs seems to be very generic. To make my questions more concrete: 1) I notice that the FileSplit class has a getStart() function. It is documented as returning the place to start "processing". Does that imply that a FileSplit does, or does not include a header? http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileSplit.html#getStart%28%29 2) Also, Its not clear to me that how compression and serialization are related. These are two inticrately coupled aspects of HDFS file writing, and im not sure what the idiom for coordinating the compression of records to the deserialization is.
Re: Python + hdfs written thrift sequence files: lots of moving parts!
Hi Jay, This may be off-topic to you, but I feel its related: Use Avro DataFiles. There's Python support already available, as well as several other languages. On Tue, Sep 25, 2012 at 10:57 PM, Jay Vyas wrote: > Hi guys! > > Im trying to read some hadoop outputted thrift files in plain old java > (without using SequenceFile.Reader). The reason for this is that I > > (1) want to understand the sequence file format better and > (2) would like to be able to port this code to a language which doesnt have > robust hadoop sequence file i/o / thrift support (python). My code looks > like this: > > So, before reading forward, if anyone has : > > 1) Some general hints on how to create a Sequence file with thrift encoded > key values in python would be very useful. > 2) Some tips on the generic approach for reading a sequencefile (the > comments seem to be a bit underspecified in the SequenceFile header) > > I'd appreciate it! > > Now, here is my adventure into thrift/hdfs sequence file i/o : > > I've written a simple stub which , I think, should be the start of a > sequence file reader (just tries to skip the header and get straight to the > data). > > But it doesnt handle compression. > > http://pastebin.com/vyfgjML9 > > So, this code ^^ appears to fail with cryptic errors : "don't know what > type: 15". > > This error comes from a case statement, which attempts to determine what > type of thrift record is being read in: > "fail 127 don't know what type: 15" > > private byte getTType(byte type) throws TProtocolException { > switch ((byte)(type & 0x0f)) { > case TType.STOP: > return TType.STOP; > case Types.BOOLEAN_FALSE: > case Types.BOOLEAN_TRUE: > return TType.BOOL; > > case Types.STRUCT: > return TType.STRUCT; > default: > throw new TProtocolException("don't know what type: " + > (byte)(type & 0x0f)); > } > > Upon further investigation, I have found that, in fact, the Configuration > object is (of course) heavily utilized by the SequenceFile reader, in > particular, to > determine the Codec. That corroborates my hypothesis that the data needs > to be decompressed or decoded before it can be deserialized by thrift, I > believe. > > So... I guess what Im assuming is missing here, is that I don't know how to > manually reproduce the Codec/GZip, etc.. logic inside of > SequenceFile.Reader in plain old java (i.e without cheating and using the > SequenceFile.Reader class that is configured in our mapreduce soruce > code). > > With my end goal being to read the file in python, I think it would be nice > to be able to read the sequencefile in java, and use this as a template > (since I know that my thrift objects and serialization are working > correctly in my current java source codebase, when read in from > SequenceFile.Reader api). > > Any suggestions on how I can distill the logic of the SequenceFile.Reader > class into a simplified version which is specific to my data, so that I can > start porting into a python script which is capable of scanning a few real > sequencefiles off of HDFS would be much appreciated !!! > > In general... what are the core steps for doing i/o with sequence files > that are compressed and or serialized in different formats? Do we > decompress first , and then deserialize? Or do them both at the same time > ? Thanks! > > PS I've added an issue to github here > https://github.com/matteobertozzi/Hadoop/issues/5, for a python > SequenceFile reader. If I get some helpful hints on this thread maybe I > can directly implement an example on matteobertozzi's python hadoop trunk. > > -- > Jay Vyas > MMSB/UCHC -- Harsh J
Python + hdfs written thrift sequence files: lots of moving parts!
Hi guys! Im trying to read some hadoop outputted thrift files in plain old java (without using SequenceFile.Reader). The reason for this is that I (1) want to understand the sequence file format better and (2) would like to be able to port this code to a language which doesnt have robust hadoop sequence file i/o / thrift support (python). My code looks like this: So, before reading forward, if anyone has : 1) Some general hints on how to create a Sequence file with thrift encoded key values in python would be very useful. 2) Some tips on the generic approach for reading a sequencefile (the comments seem to be a bit underspecified in the SequenceFile header) I'd appreciate it! Now, here is my adventure into thrift/hdfs sequence file i/o : I've written a simple stub which , I think, should be the start of a sequence file reader (just tries to skip the header and get straight to the data). But it doesnt handle compression. http://pastebin.com/vyfgjML9 So, this code ^^ appears to fail with cryptic errors : "don't know what type: 15". This error comes from a case statement, which attempts to determine what type of thrift record is being read in: "fail 127 don't know what type: 15" private byte getTType(byte type) throws TProtocolException { switch ((byte)(type & 0x0f)) { case TType.STOP: return TType.STOP; case Types.BOOLEAN_FALSE: case Types.BOOLEAN_TRUE: return TType.BOOL; case Types.STRUCT: return TType.STRUCT; default: throw new TProtocolException("don't know what type: " + (byte)(type & 0x0f)); } Upon further investigation, I have found that, in fact, the Configuration object is (of course) heavily utilized by the SequenceFile reader, in particular, to determine the Codec. That corroborates my hypothesis that the data needs to be decompressed or decoded before it can be deserialized by thrift, I believe. So... I guess what Im assuming is missing here, is that I don't know how to manually reproduce the Codec/GZip, etc.. logic inside of SequenceFile.Reader in plain old java (i.e without cheating and using the SequenceFile.Reader class that is configured in our mapreduce soruce code). With my end goal being to read the file in python, I think it would be nice to be able to read the sequencefile in java, and use this as a template (since I know that my thrift objects and serialization are working correctly in my current java source codebase, when read in from SequenceFile.Reader api). Any suggestions on how I can distill the logic of the SequenceFile.Reader class into a simplified version which is specific to my data, so that I can start porting into a python script which is capable of scanning a few real sequencefiles off of HDFS would be much appreciated !!! In general... what are the core steps for doing i/o with sequence files that are compressed and or serialized in different formats? Do we decompress first , and then deserialize? Or do them both at the same time ? Thanks! PS I've added an issue to github here https://github.com/matteobertozzi/Hadoop/issues/5, for a python SequenceFile reader. If I get some helpful hints on this thread maybe I can directly implement an example on matteobertozzi's python hadoop trunk. -- Jay Vyas MMSB/UCHC