Re: MapContext.getInputSplit() returns nothing
Thanks, it looks like I can write a line reader in C++ that roughly does what the Java version does. This also means that I can deserialise my own custom formats as well. Thanks! Roshan On Tue, Jun 16, 2009 at 12:22 PM, Owen O'Malley wrote: > Sorry, I forget how much isn't clear to people who are just starting. > > FileInputFormat creates FileSplits. The serialization is very stable and > can't be changed without breaking things. The reason that pipes can't > stringify it is that the string form of input splits are ambiguous (and > since it is user code, we really can't make assumptions about it). The > format of FileSplit is: > > <16 bit filename byte length> > > <64 bit offset> > <64 bit length> > > Technically the filename uses a funky utf-8 encoding, but in practice as > long as the filename has ascii characters they are ascii. Look at > org.apache.hadoop.io.UTF.writeString for the precise definition. > > -- Owen >
Re: MapContext.getInputSplit() returns nothing
Sorry, I forget how much isn't clear to people who are just starting. FileInputFormat creates FileSplits. The serialization is very stable and can't be changed without breaking things. The reason that pipes can't stringify it is that the string form of input splits are ambiguous (and since it is user code, we really can't make assumptions about it). The format of FileSplit is: <16 bit filename byte length> <64 bit offset> <64 bit length> Technically the filename uses a funky utf-8 encoding, but in practice as long as the filename has ascii characters they are ascii. Look at org.apache.hadoop.io.UTF.writeString for the precise definition. -- Owen
Re: MapContext.getInputSplit() returns nothing
So after squinting at this a bit I feel this is the format: Length of string: 00 23 String: 68 64 66 73 3A 2F 2F 6E 79 63 2D 71 77 73 2D 30 32 39 2F 69 6E 2D 64 69 72 2F 77 6F 72 64 73 2E 74 78 74 Start Offset: 00 00 00 00 00 00 00 00 Size: 00 00 00 00 00 02 C4 AC And this should be the split for file "hdfs://nyc-qws-029/in-dir/words.txt" from offset 0 to 181420. That said, is there some reason why this is the format? I don't want the deserialiser I write to break from one version of Hadoop to the next. Roshan On Tue, Jun 16, 2009 at 9:41 AM, Roshan James < roshan.james.subscript...@gmail.com> wrote: > Why dont we convert input split information into the same string format > that is displayed in the webUI? Something like this - > "hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185". Its a simple format > and we can always parse such a string in C++. > > Is there some reason for the current binary format? If there is good reason > for it, I am game to write such a deserialiser class. Is there some > reference for this binary format that I can use to write the deserialiser? > > Roshan > > > On Mon, Jun 15, 2009 at 5:40 PM, Owen O'Malley wrote: > >> *Sigh* We need Avro for input splits. >> >> That is the expected behavior. It would be great if someone wrote a C++ >> FileInputSplit class that took a binary string and converted it back to a >> filename, offset, and length. >> >> -- Owen >> > >
Re: MapContext.getInputSplit() returns nothing
Why dont we convert input split information into the same string format that is displayed in the webUI? Something like this - "hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185". Its a simple format and we can always parse such a string in C++. Is there some reason for the current binary format? If there is good reason for it, I am game to write such a deserialiser class. Is there some reference for this binary format that I can use to write the deserialiser? Roshan On Mon, Jun 15, 2009 at 5:40 PM, Owen O'Malley wrote: > *Sigh* We need Avro for input splits. > > That is the expected behavior. It would be great if someone wrote a C++ > FileInputSplit class that took a binary string and converted it back to a > filename, offset, and length. > > -- Owen >
Re: MapContext.getInputSplit() returns nothing
*Sigh* We need Avro for input splits. That is the expected behavior. It would be great if someone wrote a C+ + FileInputSplit class that took a binary string and converted it back to a filename, offset, and length. -- Owen
Re: MapContext.getInputSplit() returns nothing
After some debugging I see that the "string" returned by getInputSplit() has several non-text characters in it. When dumped as hex it looks like this - 00 23 68 64 66 73 3A 2F 2F 6E 79 63 2D 71 77 73 2D 30 32 39 2F 69 6E 2D 64 69 72 2F 77 6F 72 64 73 2E 74 78 74 00 00 00 00 00 00 00 00 00 00 00 00 00 02 C4 AC In text, this is roughly - ")hdfs://nyc-qws-029/in-dir/words912415.txt�Р". Now I was expecting a human readable string, something like "hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185". i.e. a description of the split that I can parse out. After a couple of quick glances at the the pipes code it looks like the Java InputSplit object it passed to the C++ wrapper as is, without any explicit conversion to string. Since I am new to Hadoop, I am not sure if this is a bug or something I am doing wrong. Please advice, Roshan On Fri, Jun 12, 2009 at 7:02 PM, Roshan James < roshan.james.subscript...@gmail.com> wrote: > I am working with the wordcount example of Hadoop Pipes (0.20.0). I have a > 7 machine cluster. > > When I look at MapContext.getInputSplit() in my map function, I see that it > returns the empty string. I was expecting to see a filename and some sort of > range specification of so. I am using the default java record reader right > now. Is this a know bug or am I missing something? > > best, > Roshan >
MapContext.getInputSplit() returns nothing
I am working with the wordcount example of Hadoop Pipes (0.20.0). I have a 7 machine cluster. When I look at MapContext.getInputSplit() in my map function, I see that it returns the empty string. I was expecting to see a filename and some sort of range specification of so. I am using the default java record reader right now. Is this a know bug or am I missing something? best, Roshan