Re: Map/Reduce with XML files ..

2008-04-29 Thread Prasan Ary
string Here's the code. If folks are interested, I can submit it as a patch as well. Prasan Ary wrote: Colin,Is it possible that you share some of the code with us? thx, PrasanColin Evans [EMAIL PROTECTED] wrote:We ended up subclassing TextInputFormat and adding

Re: on number of input files and split size

2008-04-04 Thread Prasan Ary
are I/O bound, they will be able to read 100MB of data in a just a few seconds at most. Startup time for a hadoop job is typically 10 seconds or more. On 4/4/08 12:58 PM, Prasan Ary wrote: I have a question on how input files are split before they are given out to Map functions. Say I have

Re: distcp fails :Input source not found

2008-04-02 Thread Prasan Ary
Anybody ? Any thoughts why this might be happening? Here is what is happening directly from the ec2 screen. The ID and Secret Key are the only things changed. I'm running hadoop 15.3 from the public ami. I launched a 2 machine cluster using the ec2 scripts in the src/contrib/ec2/bin

distcp fails :Input source not found

2008-04-01 Thread Prasan Ary
Hi, I am running hadoop 0.15.3 on 2 EC2 instances from a public ami ( ami-381df851) . Our input files are on S3. When I try to do a distcp for an Input file from S3 onto hdfs on EC2, the copy fails with an error that the file does not exist. However, if I run copyToLocal from S3

Re: distcp fails :Input source not found

2008-04-01 Thread Prasan Ary
That was a typo in my email. I do have s3:// in my command when it fails. --- [EMAIL PROTECTED] wrote: bin/hadoop distcp s3//:@/fileone.txt /somefolder_on_hdfs/fileone.txt : Fails - Input source doesnt exist. Should s3//... be s3://...? Nicholas

Re: Map/reduce with input files on S3

2008-03-26 Thread Prasan Ary
and accessed from there. -- Owen O'Malley [EMAIL PROTECTED] wrote: On Mar 25, 2008, at 1:07 PM, Prasan Ary wrote: I am running hadoop on EC2. I want to run a jar MR application on EC2 such that input and output

Re: Map/reduce with input files on S3

2008-03-26 Thread Prasan Ary
I changed the configuration a little so that the MR jar file now runs on my local hadoop cluster, but takes input files from S3. I get the following output: 08/03/26 17:32:39 INFO mapred.FileInputFormat: Total input paths to process : 1 08/03/26 17:32:44 INFO mapred.JobClient: Running

Map/reduce with input files on S3

2008-03-25 Thread Prasan Ary
I am running hadoop on EC2. I want to run a jar MR application on EC2 such that input and output files are on S3. I configured hadoop-site.xml so that fs.default.name property points to my s3 bucket with all required identifications (eg; s3://ID:secret key@bucket ). I created an input

naming output files from Reduce

2008-03-12 Thread Prasan Ary
I have two Map/Reduce jobs and both of them output a file each. Is there a way I can name these output files different from the default names of part- ? thanks. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection

displaying intermediate results of map/reduce

2008-03-06 Thread Prasan Ary
Hi All, I am using eclipse to write a map/reduce java application that connects to hadoop on remote cluster. Is there a way I can display intermediate results of map ( or reduce) much the same way as I would use System.out.println( variable_name) if I were running any application on a