Thanks, it looks like I can write a line reader in C++ that roughly does what the Java version does. This also means that I can deserialise my own custom formats as well. Thanks!
Roshan On Tue, Jun 16, 2009 at 12:22 PM, Owen O'Malley <omal...@apache.org> wrote: > Sorry, I forget how much isn't clear to people who are just starting. > > FileInputFormat creates FileSplits. The serialization is very stable and > can't be changed without breaking things. The reason that pipes can't > stringify it is that the string form of input splits are ambiguous (and > since it is user code, we really can't make assumptions about it). The > format of FileSplit is: > > <16 bit filename byte length> > <filename in bytes> > <64 bit offset> > <64 bit length> > > Technically the filename uses a funky utf-8 encoding, but in practice as > long as the filename has ascii characters they are ascii. Look at > org.apache.hadoop.io.UTF.writeString for the precise definition. > > -- Owen >