Re: MapContext.getInputSplit() returns nothing

2009-06-17 Thread Roshan James
Thanks, it looks like I can write a line reader in C++ that roughly does
what the Java version does. This also means that I can deserialise my own
custom formats as well. Thanks!

Roshan

On Tue, Jun 16, 2009 at 12:22 PM, Owen O'Malley  wrote:

> Sorry, I forget how much isn't clear to people who are just starting.
>
> FileInputFormat creates FileSplits. The serialization is very stable and
> can't be changed without breaking things. The reason that pipes can't
> stringify it is that the string form of input splits are ambiguous (and
> since it is user code, we really can't make assumptions about it). The
> format of FileSplit is:
>
> <16 bit filename byte length>
> 
> <64 bit offset>
> <64 bit length>
>
> Technically the filename uses a funky utf-8 encoding, but in practice as
> long as the filename has ascii characters they are ascii. Look at
> org.apache.hadoop.io.UTF.writeString for the precise definition.
>
> -- Owen
>


Re: MapContext.getInputSplit() returns nothing

2009-06-16 Thread Owen O'Malley

Sorry, I forget how much isn't clear to people who are just starting.

FileInputFormat creates FileSplits. The serialization is very stable  
and can't be changed without breaking things. The reason that pipes  
can't stringify it is that the string form of input splits are  
ambiguous (and since it is user code, we really can't make assumptions  
about it). The format of FileSplit is:


<16 bit filename byte length>

<64 bit offset>
<64 bit length>

Technically the filename uses a funky utf-8 encoding, but in practice  
as long as the filename has ascii characters they are ascii. Look at  
org.apache.hadoop.io.UTF.writeString for the precise definition.


-- Owen


Re: MapContext.getInputSplit() returns nothing

2009-06-16 Thread Roshan James
So after squinting at this a bit I feel this is the format:

Length of string: 00 23

String:
68 64 66 73 3A
2F 2F 6E 79 63
2D 71 77 73 2D
30 32 39 2F 69
6E 2D 64 69 72
2F 77 6F 72 64
73 2E 74 78 74

Start Offset: 00 00 00 00 00 00 00 00
Size: 00 00 00 00 00 02 C4 AC

And this should be the split for file "hdfs://nyc-qws-029/in-dir/words.txt"
from offset 0 to 181420.

That said, is there some reason why this is the format? I don't want the
deserialiser I write to break from one version of Hadoop to the next.

Roshan


On Tue, Jun 16, 2009 at 9:41 AM, Roshan James <
roshan.james.subscript...@gmail.com> wrote:

> Why dont we convert input split information into the same string format
> that is displayed in the webUI? Something like this -
> "hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185". Its a simple format
> and we can always parse such a string in C++.
>
> Is there some reason for the current binary format? If there is good reason
> for it, I am game to write such a deserialiser class. Is there some
> reference for this binary format that I can use to write the deserialiser?
>
> Roshan
>
>
> On Mon, Jun 15, 2009 at 5:40 PM, Owen O'Malley  wrote:
>
>> *Sigh* We need Avro for input splits.
>>
>> That is the expected behavior. It would be great if someone wrote a C++
>> FileInputSplit class that took a binary string and converted it back to a
>> filename, offset, and length.
>>
>> -- Owen
>>
>
>


Re: MapContext.getInputSplit() returns nothing

2009-06-16 Thread Roshan James
Why dont we convert input split information into the same string format that
is displayed in the webUI? Something like this -
"hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185". Its a simple format
and we can always parse such a string in C++.

Is there some reason for the current binary format? If there is good reason
for it, I am game to write such a deserialiser class. Is there some
reference for this binary format that I can use to write the deserialiser?

Roshan

On Mon, Jun 15, 2009 at 5:40 PM, Owen O'Malley  wrote:

> *Sigh* We need Avro for input splits.
>
> That is the expected behavior. It would be great if someone wrote a C++
> FileInputSplit class that took a binary string and converted it back to a
> filename, offset, and length.
>
> -- Owen
>


Re: MapContext.getInputSplit() returns nothing

2009-06-15 Thread Owen O'Malley

*Sigh* We need Avro for input splits.

That is the expected behavior. It would be great if someone wrote a C+ 
+ FileInputSplit class that took a binary string and converted it back  
to a filename, offset, and length.


-- Owen


Re: MapContext.getInputSplit() returns nothing

2009-06-15 Thread Roshan James
After some debugging I see that the "string" returned by getInputSplit() has
several non-text characters in it. When dumped as hex it looks like this -

00 23 68 64 66 73 3A 2F 2F 6E 79 63 2D 71 77 73 2D 30 32 39 2F 69 6E 2D 64
69 72 2F 77 6F 72 64 73 2E 74 78 74 00 00 00 00 00 00 00 00 00 00 00 00 00
02 C4 AC

In text, this is roughly -
")hdfs://nyc-qws-029/in-dir/words912415.txt�Р".

Now I was expecting a human readable string, something like
"hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185". i.e. a description of
the split that I can parse out.

After a couple of quick glances at the the pipes code it looks like the Java
InputSplit object it passed to the C++ wrapper as is, without any explicit
conversion to string.

Since I am new to Hadoop, I am not sure if this is a bug or something I am
doing wrong.

Please advice,
Roshan

On Fri, Jun 12, 2009 at 7:02 PM, Roshan James <
roshan.james.subscript...@gmail.com> wrote:

> I am working with the wordcount example of Hadoop Pipes (0.20.0). I have a
> 7 machine cluster.
>
> When I look at MapContext.getInputSplit() in my map function, I see that it
> returns the empty string. I was expecting to see a filename and some sort of
> range specification of so. I am using the default java record reader right
> now. Is this a know bug or am I missing something?
>
> best,
> Roshan
>


MapContext.getInputSplit() returns nothing

2009-06-12 Thread Roshan James
I am working with the wordcount example of Hadoop Pipes (0.20.0). I have a 7
machine cluster.

When I look at MapContext.getInputSplit() in my map function, I see that it
returns the empty string. I was expecting to see a filename and some sort of
range specification of so. I am using the default java record reader right
now. Is this a know bug or am I missing something?

best,
Roshan