I've got a tar.gz file that has many 3rd party jars in it that my MR job
requires. This tar.gz file is located on hdfs. When configuring my MR
job, I call DistributedCache.addArchiveToClassPath(), passing in the hdfs
path to the tar.gz file. When the Mapper executes I get a
I'm curious what are some best practices for structuring jars for a
business framework that uses Map/Reduce? Note: This is assuming you aren't
invoking MR manually via the cmd line, but have Hadoop integrated into a
larger business framework that invokes MR jobs programmatically.
By business
Two scenarios I can think of are re-implementations of Twitter's Storm (
http://storm-project.net/) and DryadLinq (
http://research.microsoft.com/en-us/projects/dryadlinq/).
Storm, a distributed realtime computation framework used for analyzing
realtime steams of data, doesn't really need to be
Totally off topic, but kind'a not. Why the hell are we still using
something our ancestors used? I didn't even know listservs were still
in existence until I started using Apache open source software. I was
like, listservs...really?
On Wed, Mar 20, 2013 at 9:23 AM, Fabio Pitzolu
From: turboc...@gmail.com [turboc...@gmail.com] on behalf of John
Conwell [j...@iamjohn.me]
Sent: Wednesday, March 20, 2013 12:31 PM
To: user@hadoop.apache.org
Subject: Re: unsubscribe
Totally off topic, but kind'a not. Why the hell are we still using
No!
On Tue, Mar 19, 2013 at 3:23 PM, Bruce Perttunen
bruceperttu...@gmail.comwrote:
Unsubscribe
--
Thanks,
John C
It depends on a couple factors. First are you developing a product where
customers will need the freedom to choose what cloud provider to use, or
something in house where you can standardize on one cloud provider (like
AWS). And second, do you only need to spin up Hadoop resources? Or do you
AWS MapReduce (EMR) does not use S3 for its HDFS persistance. If it did
your S3 billing would be massive :) EMR reads all input jar files and
input data from S3, but it copies these files down to its local disk. It
then does starts the MR process, doing all HDFS reads and writes to the
local
Do you mean porting existing cuda code away from Cuda to just some language
like python using pipes? Or creating a solution that uses pipes to chain
mappers / reducers together, where the mappers and/or reducers invoke
Cuda kernels? Or something else entirely?
You could do something like the
If you use the TextInputFormat is your mapreduce job's input format, then
Hadoop doesn't need your input data to be in a sequence file. It will read
your text file, and call the mapper for each line in the text file (\n
delimited), where the key value is the byte offset of that line from the
You might also want to take a look at Storm, as thats what its design to
do: https://github.com/nathanmarz/storm/wiki
On Mon, Dec 5, 2011 at 1:34 PM, Mike Spreitzer mspre...@us.ibm.com wrote:
Burak,
Before we can really answer your question, you need to give us some more
information on the
I'm a big fan of Whirr, though I dont think it support EBS persistance. My
hadoop deployment strategy has always been store input and output data on
S3, spin up my hadoop cluster with either whirr or Elastic Map Reduce, run
the job, store output data on S3, and kill the cluster.
On Tue, Nov 29,
I'm not sure, but I would suspect that Mahout has some low level map/reduce
jobs for this. You might start there.
On Fri, Nov 18, 2011 at 8:59 AM, Mike Spreitzer mspre...@us.ibm.com wrote:
Who is doing multiplication of large dense matrices using Hadoop? What is
a good way to do that
FileStatus[] files = fs.listStatus(new Path(path));
for (FileStatus fileStatus : files)
{
//...do stuff ehre
}
On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch raimon.bo...@gmail.comwrote:
Hi,
I'm wondering how can I browse an hdfs folder using the classes
in org.apache.hadoop.fs package.
In trouble shooting some issues on our hadoop cluster on EC2, I keep getting
pointed back to properly configuring the /etc/hosts file. But the problem
is I've found about 5 different conflicting articles about how to config the
hosts file. So I'm hoping to get a definitive answer to how the
: turboc...@gmail.com [mailto:turboc...@gmail.com] On Behalf Of John
Conwell
Sent: Thursday, September 29, 2011 10:50 AM
To: common-user@hadoop.apache.org
Subject: Re: Running multiple MR Job's in sequence
After you kick off a job, say JobA, your client doesn't need to sit and
ping
Hadoop to see
I second that. Whirr is an invaluable resource for automagically spinning
up resources on EC2
On Wed, Sep 7, 2011 at 4:28 AM, Harsh J ha...@cloudera.com wrote:
You are looking for the Apache Whirr project: http://whirr.apache.org/
Here's a great article at Phil Whelan's site that covers
, 2011 at 10:03 AM, Shahnawaz Saifi shahsa...@gmail.comwrote:
Thanks a lot, I will definitely try this. But there are so many blogs about
configuring hadoop/hbase and bundling images to s3 bucket. Whirr is faster
or smoother than this concept?
regards,
Shah
On Wed, Sep 7, 2011 at 8:28 PM, John
You can override the configure method in you mapper or reducer, and call
JobConf.get(xpatterns.hadoop.content.job.id)
This will return you the UUID for the job id
On Mon, Sep 5, 2011 at 10:02 PM, Meghana meghana.mara...@germinait.comwrote:
Hi,
We configure a org.apache.hadoop.mapreduce.Job,
I'm running into a wall with one of my map reduce jobs (actually its a 7
jobs, chained together). I get to the 5th MR job, which takes as input the
output from the 3rd MR job, and right off the bat I start getting Lost task
tracker and Could not obtain block... errors. Eventually I get enough of
I have a MR job that repeatedly fails during a join operation in the Mapper,
with the errors java.io.IOException: Could not obtain block. I'm running
on EC2, on a 12 node cluster, provisioned by whirr. Oddly enough on a 5
node cluster the MR job runs through without any problems.
The repeated
21 matches
Mail list logo