Re: AWS MapReduce

2012-03-05 Thread John Conwell
AWS MapReduce (EMR) does not use S3 for its HDFS persistance. If it did your S3 billing would be massive :) EMR reads all input jar files and input data from S3, but it copies these files down to its local disk. It then does starts the MR process, doing all HDFS reads and writes to the local dis

Re: HADOOP PIPES with CUDA

2012-02-13 Thread John Conwell
Do you mean porting existing cuda code away from Cuda to just some language like python using pipes? Or creating a solution that uses pipes to chain mappers / reducers together, where the mappers and/or reducers invoke Cuda kernels? Or something else entirely? You could do something like the sec

Re: Sorting text data

2012-01-30 Thread John Conwell
If you use the TextInputFormat is your mapreduce job's input format, then Hadoop doesn't need your input data to be in a sequence file. It will read your text file, and call the mapper for each line in the text file (\n delimited), where the key value is the byte offset of that line from the begin

Re: Running a job continuously

2011-12-05 Thread John Conwell
You might also want to take a look at Storm, as thats what its design to do: https://github.com/nathanmarz/storm/wiki On Mon, Dec 5, 2011 at 1:34 PM, Mike Spreitzer wrote: > Burak, > Before we can really answer your question, you need to give us some more > information on the processing you want

Re: choices for deploying a small hadoop cluster on EC2

2011-11-29 Thread John Conwell
I'm a big fan of Whirr, though I dont think it support EBS persistance. My hadoop deployment strategy has always been store input and output data on S3, spin up my hadoop cluster with either whirr or Elastic Map Reduce, run the job, store output data on S3, and kill the cluster. On Tue, Nov 29,

Re: Matrix multiplication in Hadoop

2011-11-18 Thread John Conwell
I'm not sure, but I would suspect that Mahout has some low level map/reduce jobs for this. You might start there. On Fri, Nov 18, 2011 at 8:59 AM, Mike Spreitzer wrote: > Who is doing multiplication of large dense matrices using Hadoop? What is > a good way to do that computation using Hadoop

Re: How to iterate over a hdfs folder with hadoop

2011-10-10 Thread John Conwell
FileStatus[] files = fs.listStatus(new Path(path)); for (FileStatus fileStatus : files) { //...do stuff ehre } On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch wrote: > Hi, > > I'm wondering how can I browse an hdfs folder using the classes > in org.apache.hadoop.fs package. The operation that I

What should be in the hosts file on a hadoop cluster?

2011-10-07 Thread John Conwell
In trouble shooting some issues on our hadoop cluster on EC2, I keep getting pointed back to properly configuring the /etc/hosts file. But the problem is I've found about 5 different conflicting articles about how to config the hosts file. So I'm hoping to get a definitive answer to how the hosts

Re: Running multiple MR Job's in sequence

2011-09-29 Thread John Conwell
t; From: turboc...@gmail.com [mailto:turboc...@gmail.com] On Behalf Of John > Conwell > Sent: Thursday, September 29, 2011 10:50 AM > To: common-user@hadoop.apache.org > Subject: Re: Running multiple MR Job's in sequence > > After you kick off a job, say JobA, your client d

Re: Running multiple MR Job's in sequence

2011-09-29 Thread John Conwell
After you kick off a job, say JobA, your client doesn't need to sit and ping Hadoop to see if it finished before it starts JobB. You can have the client block until the job is complete with "Job.waitForCompletion(boolean verbose)". Using this you can create a "job driver" that chains jobs togethe

Re: Hadoop on Ec2

2011-09-07 Thread John Conwell
, 2011 at 10:03 AM, Shahnawaz Saifi wrote: > Thanks a lot, I will definitely try this. But there are so many blogs about > configuring hadoop/hbase and bundling images to s3 bucket. Whirr is faster > or smoother than this concept? > > regards, > Shah > > On Wed, Sep 7, 2011

Re: Hadoop on Ec2

2011-09-07 Thread John Conwell
I second that. Whirr is an invaluable resource for automagically spinning up resources on EC2 On Wed, Sep 7, 2011 at 4:28 AM, Harsh J wrote: > You are looking for the Apache Whirr project: http://whirr.apache.org/ > > Here's a great article at Phil Whelan's site that covers getting HBase > up i

Re: How to get JobID in code?

2011-09-06 Thread John Conwell
You can override the configure method in you mapper or reducer, and call JobConf.get("xpatterns.hadoop.content.job.id") This will return you the UUID for the job id On Mon, Sep 5, 2011 at 10:02 PM, Meghana wrote: > Hi, > > We configure a org.apache.hadoop.mapreduce.Job, and then call job.submit

Lost task tracker and Could not obtain block errors

2011-07-29 Thread John Conwell
I'm running into a wall with one of my map reduce jobs (actually its a 7 jobs, chained together). I get to the 5th MR job, which takes as input the output from the 3rd MR job, and right off the bat I start getting "Lost task tracker" and "Could not obtain block..." errors. Eventually I get enough

Job fails with Could not obtain block errors

2011-07-13 Thread John Conwell
I have a MR job that repeatedly fails during a join operation in the Mapper, with the errors "java.io.IOException: Could not obtain block". I'm running on EC2, on a 12 node cluster, provisioned by whirr. Oddly enough on a 5 node cluster the MR job runs through without any problems. The repeated