Re: Map/Reduce with XML files ..

2008-04-29 Thread Prasan Ary
function on xml string Here's the code. If folks are interested, I can submit it as a patch as well. Prasan Ary wrote: Colin,Is it possible that you share some of the code with us? thx, PrasanColin Evans <[EMAIL PROTECTED]> wrote:We ended up subclassi

RE: streaming + binary input/output data?

2008-04-14 Thread Prasan Ary
John, My meaning didn't come through. If you encode binary data and treat it like any peice of text going through hadoop's default input format, at some point your binary data might have a piece that looks like 1010, in hex it might be 0A, and in ascii, might it not be interpret

RE: streaming + binary input/output data?

2008-04-14 Thread Prasan Ary
John, That's an interesting approach, but isn't it possible that an equivalent \n might get encoded in the binary data? John Menzer <[EMAIL PROTECTED]> wrote: so you mean you changed the hadoop streaming source code? actually i am not really willing to change the source code if it's not

consolidating files on S3 using distcp

2008-04-07 Thread Prasan Ary
I have 100s of files in S3 bucket and I am trying to consolidate them to fewer number of large files. I was working with code for hadoop's 'distcp' , trying to see if it is at all possible to consolidate files as they are being copied from S3 onto HDFS. So far, I havent had any luck. Has

Re: on number of input files and split size

2008-04-04 Thread Prasan Ary
startup time. If your jobs are I/O bound, they will be able to read 100MB of data in a just a few seconds at most. Startup time for a hadoop job is typically 10 seconds or more. On 4/4/08 12:58 PM, "Prasan Ary" wrote: > I have a question on how input files are split before they are giv

on number of input files and split size

2008-04-04 Thread Prasan Ary
I have a question on how input files are split before they are given out to Map functions. Say I have an input directory containing 1000 files whose total size is 100 MB, and I have 10 machines in my cluster and I have configured 10 mapred.map.tasks in hadoop-site.xml. 1. With this conf

Re: distcp fails :Input source not found

2008-04-03 Thread Prasan Ary
files from S3 to HDFS on EC2 without having to iterate through each file? [EMAIL PROTECTED] wrote: It might be a bug. Could you try the following? bin/hadoop fs -ls s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml Nicholas - Original Message From: Prasan Ary To: core-user@had

Re: distcp fails :Input source not found

2008-04-02 Thread Prasan Ary
Anybody ? Any thoughts why this might be happening? Here is what is happening directly from the ec2 screen. The ID and Secret Key are the only things changed. I'm running hadoop 15.3 from the public ami. I launched a 2 machine cluster using the ec2 scripts in the src/contrib/ec2/bin .

Re: distcp fails :Input source not found

2008-04-01 Thread Prasan Ary
Here is what is happening directly from the ec2 screen. The ID and Secret Key are the only things changed. I'm running hadoop 15.3 from the public ami. I launched a 2 machine cluster using the ec2 scripts in the src/contrib/ec2/bin . . . The file I try and copy is 9KB (I noticed previous d

Re: distcp fails :Input source not found

2008-04-01 Thread Prasan Ary
That was a typo in my email. I do have s3:// in my command when it fails. --- [EMAIL PROTECTED] wrote: > bin/hadoop distcp s3//:@/fileone.txt /somefolder_on_hdfs/fileone.txt : Fails - Input source doesnt exist. Should "s3//..." be "s3://..."? Nicholas

distcp fails :Input source not found

2008-04-01 Thread Prasan Ary
Hi, I am running hadoop 0.15.3 on 2 EC2 instances from a public ami ( ami-381df851) . Our input files are on S3. When I try to do a distcp for an Input file from S3 onto hdfs on EC2, the copy fails with an error that the file does not exist. However, if I run copyToLocal from S3 onto

Re: Map/reduce with input files on S3

2008-03-26 Thread Prasan Ary
I changed the configuration a little so that the MR jar file now runs on my local hadoop cluster, but takes input files from S3. I get the following output: 08/03/26 17:32:39 INFO mapred.FileInputFormat: Total input paths to process : 1 08/03/26 17:32:44 INFO mapred.JobClient: Running

Re: Map/reduce with input files on S3

2008-03-26 Thread Prasan Ary
to image on EC2 and accessed from there. -- Owen O'Malley <[EMAIL PROTECTED]> wrote: On Mar 25, 2008, at 1:07 PM, Prasan Ary wrote: > I am running hadoop on EC2. I want to run a jar MR application

Map/reduce with input files on S3

2008-03-25 Thread Prasan Ary
I am running hadoop on EC2. I want to run a jar MR application on EC2 such that input and output files are on S3. I configured hadoop-site.xml so that fs.default.name property points to my s3 bucket with all required identifications (eg; s3://:@ ). I created an input directory in this buc

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Prasan Ary
oing on. also, setting up FoxyProxy on firefox lets you browse your whole cluster if you setup a ssh tunnel (socks). On Mar 20, 2008, at 10:15 AM, Prasan Ary wrote: > Hi All, > I have been trying to configure Hadoop on EC2 for large number of > clusters ( 100 plus). It seems that I have to

Hadoop on EC2 for large cluster

2008-03-20 Thread Prasan Ary
Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the E

naming output files from Reduce

2008-03-12 Thread Prasan Ary
I have two Map/Reduce jobs and both of them output a file each. Is there a way I can name these output files different from the default names of "part-" ? thanks. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection

Re: reading input file only once for multiple map functions

2008-03-12 Thread Prasan Ary
ally sounds like you have taken a bit of an odd turn somewhere in your porting your algorithm to a parallel form. On 3/12/08 9:24 AM, "Prasan Ary" wrote: > I have a very large xml file as input and a couple of Map/Reduce functions. > Input key/value pair to all of my map fun

reading input file only once for multiple map functions

2008-03-12 Thread Prasan Ary
I have a very large xml file as input and a couple of Map/Reduce functions. Input key/value pair to all of my map functions is the same. I was wondering if there is a way that I read the input xml file only once, then create key/value pair (also once) and give these k/v pairs as input to my

Map/Reduce Type Mismatch error

2008-03-07 Thread Prasan Ary
Hi All, I am running a Map/Reduce on a textfile. Map takes as (key,value) input pair , and outputs as (key,value) output pair. Reduce takes as (key,value) input pair, and outputs as (key,value) output pair. I am getting a type mismatch error. Any suggestion?

displaying intermediate results of map/reduce

2008-03-06 Thread Prasan Ary
Hi All, I am using eclipse to write a map/reduce java application that connects to hadoop on remote cluster. Is there a way I can display intermediate results of map ( or reduce) much the same way as I would use System.out.println( variable_name) if I were running any application on a single

Re: map/reduce function on xml string

2008-03-04 Thread Prasan Ary
erence for this. Prasan Ary wrote: > Hi All, > I am writing a java implementation for my map/reduce function on hadoop. > Input to this is a xml file, and the map function has to process a well > formed xml records. So far I have been unable to split the xml file at xml > record bou

map/reduce function on xml string

2008-03-03 Thread Prasan Ary
Hi All, I am writing a java implementation for my map/reduce function on hadoop. Input to this is a xml file, and the map function has to process a well formed xml records. So far I have been unable to split the xml file at xml record boundary to feed into my map function. Can anybody point