I've been using EMR for the public terabyte dataset project.

In general it's worked for me, with the following caveats:

1. Hadoop 0.18.3, which meant I had to re-work some of my code that depended on newer (Hadoop 0.19.x) support.

2. It was kind of painful to get it running initially (setting up the right credentials.json file, etc)

3. You'll need S3 access, of course, which is another series of hoops to jump through.

4. You really want to run in the mode where you create an EMR job with no steps, then add steps to run - otherwise you can waste a lot of time firing up EMR jobs that fail immediately.

5. For bigger clusters, some of the Hadoop configuration parameters aren't set very well.

-- Ken

On Jan 10, 2010, at 4:21pm, Benson Margulies wrote:

That's what I meant. I haven't tried it yet, so I've got the same
question Jake has.

On Sun, Jan 10, 2010 at 6:27 PM, Jake Mannix <[email protected]> wrote:
You mean Elastic MapReduce (EMR)? Has anyone here had any luck with that
for this or other projects?

 -jake

On Jan 10, 2010 3:21 PM, "Benson Margulies" <[email protected]> wrote:

Stupid question: I thought there was a way to use the cloud as a
hadoop farm directly without having to configure instances.

On Sun, Jan 10, 2010 at 6:18 PM, Sean Owen <[email protected]> wrote: > I
like the Alestic instances...


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to