After some debugging I see that the "string" returned by getInputSplit() has
several non-text characters in it. When dumped as hex it looks like this -

00 23 68 64 66 73 3A 2F 2F 6E 79 63 2D 71 77 73 2D 30 32 39 2F 69 6E 2D 64
69 72 2F 77 6F 72 64 73 2E 74 78 74 00 00 00 00 00 00 00 00 00 00 00 00 00
02 C4 AC

In text, this is roughly -
")hdfs://nyc-qws-029/in-dir/words912415.txt�������������Р".

Now I was expecting a human readable string, something like
"hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185". i.e. a description of
the split that I can parse out.

After a couple of quick glances at the the pipes code it looks like the Java
InputSplit object it passed to the C++ wrapper as is, without any explicit
conversion to string.

Since I am new to Hadoop, I am not sure if this is a bug or something I am
doing wrong.

Please advice,
Roshan

On Fri, Jun 12, 2009 at 7:02 PM, Roshan James <
roshan.james.subscript...@gmail.com> wrote:

> I am working with the wordcount example of Hadoop Pipes (0.20.0). I have a
> 7 machine cluster.
>
> When I look at MapContext.getInputSplit() in my map function, I see that it
> returns the empty string. I was expecting to see a filename and some sort of
> range specification of so. I am using the default java record reader right
> now. Is this a know bug or am I missing something?
>
> best,
> Roshan
>

Reply via email to