AWS MapReduce (EMR) does not use S3 for its HDFS persistance. If it did
your S3 billing would be massive :) EMR reads all input jar files and
input data from S3, but it copies these files down to its local disk. It
then does starts the MR process, doing all HDFS reads and writes to the
local dis
Do you mean porting existing cuda code away from Cuda to just some language
like python using pipes? Or creating a solution that uses pipes to chain
mappers / reducers together, where the mappers and/or reducers invoke
Cuda kernels? Or something else entirely?
You could do something like the sec
If you use the TextInputFormat is your mapreduce job's input format, then
Hadoop doesn't need your input data to be in a sequence file. It will read
your text file, and call the mapper for each line in the text file (\n
delimited), where the key value is the byte offset of that line from the
begin
You might also want to take a look at Storm, as thats what its design to
do: https://github.com/nathanmarz/storm/wiki
On Mon, Dec 5, 2011 at 1:34 PM, Mike Spreitzer wrote:
> Burak,
> Before we can really answer your question, you need to give us some more
> information on the processing you want
I'm a big fan of Whirr, though I dont think it support EBS persistance. My
hadoop deployment strategy has always been store input and output data on
S3, spin up my hadoop cluster with either whirr or Elastic Map Reduce, run
the job, store output data on S3, and kill the cluster.
On Tue, Nov 29,
I'm not sure, but I would suspect that Mahout has some low level map/reduce
jobs for this. You might start there.
On Fri, Nov 18, 2011 at 8:59 AM, Mike Spreitzer wrote:
> Who is doing multiplication of large dense matrices using Hadoop? What is
> a good way to do that computation using Hadoop
FileStatus[] files = fs.listStatus(new Path(path));
for (FileStatus fileStatus : files)
{
//...do stuff ehre
}
On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch wrote:
> Hi,
>
> I'm wondering how can I browse an hdfs folder using the classes
> in org.apache.hadoop.fs package. The operation that I
In trouble shooting some issues on our hadoop cluster on EC2, I keep getting
pointed back to properly configuring the /etc/hosts file. But the problem
is I've found about 5 different conflicting articles about how to config the
hosts file. So I'm hoping to get a definitive answer to how the hosts
t; From: turboc...@gmail.com [mailto:turboc...@gmail.com] On Behalf Of John
> Conwell
> Sent: Thursday, September 29, 2011 10:50 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Running multiple MR Job's in sequence
>
> After you kick off a job, say JobA, your client d
After you kick off a job, say JobA, your client doesn't need to sit and ping
Hadoop to see if it finished before it starts JobB. You can have the client
block until the job is complete with "Job.waitForCompletion(boolean
verbose)". Using this you can create a "job driver" that chains jobs
togethe
, 2011 at 10:03 AM, Shahnawaz Saifi wrote:
> Thanks a lot, I will definitely try this. But there are so many blogs about
> configuring hadoop/hbase and bundling images to s3 bucket. Whirr is faster
> or smoother than this concept?
>
> regards,
> Shah
>
> On Wed, Sep 7, 2011
I second that. Whirr is an invaluable resource for automagically spinning
up resources on EC2
On Wed, Sep 7, 2011 at 4:28 AM, Harsh J wrote:
> You are looking for the Apache Whirr project: http://whirr.apache.org/
>
> Here's a great article at Phil Whelan's site that covers getting HBase
> up i
You can override the configure method in you mapper or reducer, and call
JobConf.get("xpatterns.hadoop.content.job.id")
This will return you the UUID for the job id
On Mon, Sep 5, 2011 at 10:02 PM, Meghana wrote:
> Hi,
>
> We configure a org.apache.hadoop.mapreduce.Job, and then call job.submit
I'm running into a wall with one of my map reduce jobs (actually its a 7
jobs, chained together). I get to the 5th MR job, which takes as input the
output from the 3rd MR job, and right off the bat I start getting "Lost task
tracker" and "Could not obtain block..." errors. Eventually I get enough
I have a MR job that repeatedly fails during a join operation in the Mapper,
with the errors "java.io.IOException: Could not obtain block". I'm running
on EC2, on a 12 node cluster, provisioned by whirr. Oddly enough on a 5
node cluster the MR job runs through without any problems.
The repeated
15 matches
Mail list logo