Doing MapReduce over Har files

2009-06-23 Thread Roshan James
When I run map reduce task over a har file as the input, I see that the input splits refer to 64mb byte boundaries inside the part file. My mappers only know how to process the contents of each logical file inside the har file. Is there some way by which I can take the offset range specified by th

Can a hadoop pipes job be given multiple input directories?

2009-06-18 Thread Roshan James
In the documentation for Hadoop Streaming it says that the "-input" option can be specified multiple times for multiples input directories. The same does not seem to work with Pipes. Is there some way to specify multiple input directories for pipes jobs? Roshan ps. With muliple input dirs this i

Re: Data replication and moving computation

2009-06-18 Thread Roshan James
Further, look at the namenode file system browser for your cluster to see the chunking in action. http://wiki.apache.org/hadoop/WebApp%20URLs Roshan On Thu, Jun 18, 2009 at 6:28 AM, Harish Mallipeddi < harish.mallipe...@gmail.com> wrote: > On Thu, Jun 18, 2009 at 3:43 PM, rajeev gupta wrote: >

Re: JobControl for Pipes?

2009-06-18 Thread Roshan James
> with > either. I do not know how pipes interacts with either. > > On Wed, Jun 17, 2009 at 12:43 PM, Roshan James < > roshan.james.subscript...@gmail.com> wrote: > > > Hello, Is there any way to express dependencies between map-reduce jobs > > (such as in org.

Re: Pipes example wordcount-nopipe.cc failed when reading from input splits

2009-06-18 Thread Roshan James
I did get this working. InputSplit information is not returned clearly. You may want to look at this thread - http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3cee216d470906121602k7f914179u5d9555e7bb080...@mail.gmail.com%3e On Thu, Jun 18, 2009 at 12:49 AM, Jianmin Woo wrot

JobControl for Pipes?

2009-06-17 Thread Roshan James
Hello, Is there any way to express dependencies between map-reduce jobs (such as in org.apache.hadoop.mapred.jobcontrol) for pipes jobs? The provided header Pipes.hh does not seem to reflect any such capabilities. best, Roshan

Re: MapContext.getInputSplit() returns nothing

2009-06-17 Thread Roshan James
Thanks, it looks like I can write a line reader in C++ that roughly does what the Java version does. This also means that I can deserialise my own custom formats as well. Thanks! Roshan On Tue, Jun 16, 2009 at 12:22 PM, Owen O'Malley wrote: > Sorry, I forget how much isn't clear to people who a

Re: MapContext.getInputSplit() returns nothing

2009-06-16 Thread Roshan James
for file "hdfs://nyc-qws-029/in-dir/words.txt" from offset 0 to 181420. That said, is there some reason why this is the format? I don't want the deserialiser I write to break from one version of Hadoop to the next. Roshan On Tue, Jun 16, 2009 at 9:41 AM, Roshan James < rosh

Re: MapContext.getInputSplit() returns nothing

2009-06-16 Thread Roshan James
Why dont we convert input split information into the same string format that is displayed in the webUI? Something like this - "hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185". Its a simple format and we can always parse such a string in C++. Is there some reason for the current binary format?

Re: MapContext.getInputSplit() returns nothing

2009-06-15 Thread Roshan James
out. After a couple of quick glances at the the pipes code it looks like the Java InputSplit object it passed to the C++ wrapper as is, without any explicit conversion to string. Since I am new to Hadoop, I am not sure if this is a bug or something I am doing wrong. Please advice, Roshan On Fri, J

MapContext.getInputSplit() returns nothing

2009-06-12 Thread Roshan James
I am working with the wordcount example of Hadoop Pipes (0.20.0). I have a 7 machine cluster. When I look at MapContext.getInputSplit() in my map function, I see that it returns the empty string. I was expecting to see a filename and some sort of range specification of so. I am using the default j

Where in the WebUI do we see setStatus and stderr output?

2009-06-10 Thread Roshan James
Hi, I am new to Hadoop and am using Pipes and Hadoop ver 0.20.0. Can someone tell me where in the web UI we see status messages set by TaskContext::setStatus and the stderr? Also is stdout captured somehwere? Thanks in advance, Roshan

Chaining Pipes Tasks

2009-06-08 Thread Roshan James
Hi, I am trying to get started with Hadoop Pipes. Is there an example of chaining tasks (with Pipes) somewhere? If not, can someone tell me how I can specify the input and output directories for the second task. I was expecting to be able to set these values in JobConf, but Pipes seems to provide