DistributedCache.addArchiveToClassPath doesn't seem to work

2013-12-17 Thread John Conwell
I've got a tar.gz file that has many 3rd party jars in it that my MR job requires. This tar.gz file is located on hdfs. When configuring my MR job, I call DistributedCache.addArchiveToClassPath(), passing in the hdfs path to the tar.gz file. When the Mapper executes I get a

Map/Reduce/Driver jar(s) organization

2013-11-25 Thread John Conwell
I'm curious what are some best practices for structuring jars for a business framework that uses Map/Reduce? Note: This is assuming you aren't invoking MR manually via the cmd line, but have Hadoop integrated into a larger business framework that invokes MR jobs programmatically. By business

Re: What else can be built on top of YARN.

2013-05-29 Thread John Conwell
Two scenarios I can think of are re-implementations of Twitter's Storm ( http://storm-project.net/) and DryadLinq ( http://research.microsoft.com/en-us/projects/dryadlinq/). Storm, a distributed realtime computation framework used for analyzing realtime steams of data, doesn't really need to be

Re: unsubscribe

2013-03-20 Thread John Conwell
Totally off topic, but kind'a not. Why the hell are we still using something our ancestors used? I didn't even know listservs were still in existence until I started using Apache open source software. I was like, listservs...really? On Wed, Mar 20, 2013 at 9:23 AM, Fabio Pitzolu

Re:

2013-03-20 Thread John Conwell
From: turboc...@gmail.com [turboc...@gmail.com] on behalf of John Conwell [j...@iamjohn.me] Sent: Wednesday, March 20, 2013 12:31 PM To: user@hadoop.apache.org Subject: Re: unsubscribe Totally off topic, but kind'a not. Why the hell are we still using

Re: Unsubscribe

2013-03-19 Thread John Conwell
No! On Tue, Mar 19, 2013 at 3:23 PM, Bruce Perttunen bruceperttu...@gmail.comwrote: Unsubscribe -- Thanks, John C

Re: Best Practice: How to start and shutdown a complete cluster or adding nodes when needed (Automated with Java API or Rest) (On EC2)

2013-03-04 Thread John Conwell
It depends on a couple factors. First are you developing a product where customers will need the freedom to choose what cloud provider to use, or something in house where you can standardize on one cloud provider (like AWS). And second, do you only need to spin up Hadoop resources? Or do you

Re: AWS MapReduce

2012-03-05 Thread John Conwell
AWS MapReduce (EMR) does not use S3 for its HDFS persistance. If it did your S3 billing would be massive :) EMR reads all input jar files and input data from S3, but it copies these files down to its local disk. It then does starts the MR process, doing all HDFS reads and writes to the local

Re: HADOOP PIPES with CUDA

2012-02-13 Thread John Conwell
Do you mean porting existing cuda code away from Cuda to just some language like python using pipes? Or creating a solution that uses pipes to chain mappers / reducers together, where the mappers and/or reducers invoke Cuda kernels? Or something else entirely? You could do something like the

Re: Sorting text data

2012-01-30 Thread John Conwell
If you use the TextInputFormat is your mapreduce job's input format, then Hadoop doesn't need your input data to be in a sequence file. It will read your text file, and call the mapper for each line in the text file (\n delimited), where the key value is the byte offset of that line from the

Re: Running a job continuously

2011-12-05 Thread John Conwell
You might also want to take a look at Storm, as thats what its design to do: https://github.com/nathanmarz/storm/wiki On Mon, Dec 5, 2011 at 1:34 PM, Mike Spreitzer mspre...@us.ibm.com wrote: Burak, Before we can really answer your question, you need to give us some more information on the

Re: choices for deploying a small hadoop cluster on EC2

2011-11-29 Thread John Conwell
I'm a big fan of Whirr, though I dont think it support EBS persistance. My hadoop deployment strategy has always been store input and output data on S3, spin up my hadoop cluster with either whirr or Elastic Map Reduce, run the job, store output data on S3, and kill the cluster. On Tue, Nov 29,

Re: Matrix multiplication in Hadoop

2011-11-18 Thread John Conwell
I'm not sure, but I would suspect that Mahout has some low level map/reduce jobs for this. You might start there. On Fri, Nov 18, 2011 at 8:59 AM, Mike Spreitzer mspre...@us.ibm.com wrote: Who is doing multiplication of large dense matrices using Hadoop? What is a good way to do that

Re: How to iterate over a hdfs folder with hadoop

2011-10-10 Thread John Conwell
FileStatus[] files = fs.listStatus(new Path(path)); for (FileStatus fileStatus : files) { //...do stuff ehre } On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch raimon.bo...@gmail.comwrote: Hi, I'm wondering how can I browse an hdfs folder using the classes in org.apache.hadoop.fs package.

What should be in the hosts file on a hadoop cluster?

2011-10-07 Thread John Conwell
In trouble shooting some issues on our hadoop cluster on EC2, I keep getting pointed back to properly configuring the /etc/hosts file. But the problem is I've found about 5 different conflicting articles about how to config the hosts file. So I'm hoping to get a definitive answer to how the

Re: Running multiple MR Job's in sequence

2011-09-29 Thread John Conwell
: turboc...@gmail.com [mailto:turboc...@gmail.com] On Behalf Of John Conwell Sent: Thursday, September 29, 2011 10:50 AM To: common-user@hadoop.apache.org Subject: Re: Running multiple MR Job's in sequence After you kick off a job, say JobA, your client doesn't need to sit and ping Hadoop to see

Re: Hadoop on Ec2

2011-09-07 Thread John Conwell
I second that. Whirr is an invaluable resource for automagically spinning up resources on EC2 On Wed, Sep 7, 2011 at 4:28 AM, Harsh J ha...@cloudera.com wrote: You are looking for the Apache Whirr project: http://whirr.apache.org/ Here's a great article at Phil Whelan's site that covers

Re: Hadoop on Ec2

2011-09-07 Thread John Conwell
, 2011 at 10:03 AM, Shahnawaz Saifi shahsa...@gmail.comwrote: Thanks a lot, I will definitely try this. But there are so many blogs about configuring hadoop/hbase and bundling images to s3 bucket. Whirr is faster or smoother than this concept? regards, Shah On Wed, Sep 7, 2011 at 8:28 PM, John

Re: How to get JobID in code?

2011-09-06 Thread John Conwell
You can override the configure method in you mapper or reducer, and call JobConf.get(xpatterns.hadoop.content.job.id) This will return you the UUID for the job id On Mon, Sep 5, 2011 at 10:02 PM, Meghana meghana.mara...@germinait.comwrote: Hi, We configure a org.apache.hadoop.mapreduce.Job,

Lost task tracker and Could not obtain block errors

2011-07-29 Thread John Conwell
I'm running into a wall with one of my map reduce jobs (actually its a 7 jobs, chained together). I get to the 5th MR job, which takes as input the output from the 3rd MR job, and right off the bat I start getting Lost task tracker and Could not obtain block... errors. Eventually I get enough of

Job fails with Could not obtain block errors

2011-07-13 Thread John Conwell
I have a MR job that repeatedly fails during a join operation in the Mapper, with the errors java.io.IOException: Could not obtain block. I'm running on EC2, on a 12 node cluster, provisioned by whirr. Oddly enough on a 5 node cluster the MR job runs through without any problems. The repeated