So after squinting at this a bit I feel this is the format: Length of string: 00 23
String: 68 64 66 73 3A 2F 2F 6E 79 63 2D 71 77 73 2D 30 32 39 2F 69 6E 2D 64 69 72 2F 77 6F 72 64 73 2E 74 78 74 Start Offset: 00 00 00 00 00 00 00 00 Size: 00 00 00 00 00 02 C4 AC And this should be the split for file "hdfs://nyc-qws-029/in-dir/words.txt" from offset 0 to 181420. That said, is there some reason why this is the format? I don't want the deserialiser I write to break from one version of Hadoop to the next. Roshan On Tue, Jun 16, 2009 at 9:41 AM, Roshan James < roshan.james.subscript...@gmail.com> wrote: > Why dont we convert input split information into the same string format > that is displayed in the webUI? Something like this - > "hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185". Its a simple format > and we can always parse such a string in C++. > > Is there some reason for the current binary format? If there is good reason > for it, I am game to write such a deserialiser class. Is there some > reference for this binary format that I can use to write the deserialiser? > > Roshan > > > On Mon, Jun 15, 2009 at 5:40 PM, Owen O'Malley <omal...@apache.org> wrote: > >> *Sigh* We need Avro for input splits. >> >> That is the expected behavior. It would be great if someone wrote a C++ >> FileInputSplit class that took a binary string and converted it back to a >> filename, offset, and length. >> >> -- Owen >> > >