On May 19, 2009, at 7:11 AM, Grant Ingersoll wrote:
On May 19, 2009, at 6:59 AM, Tim Bass wrote:
Dear All,
A few months ago (on the developer's list) we briefly touched on the
idea of building a Mahout public AMI on EC2.
Subsequently, Amazon released EMR and a number of folks have
experimented with running sample Mahout jobs on EMR.
What are the pros and cons of creating a public Mahout AMI with
Hadoop
and MapReduce configured with the versions that
are supported by the developers, in addition to Amazon's EMR
implementation?
AFAICT, one issue seems to be that EMR locks you into a specific
Hadoop instance. Not sure if "locks" is too strong, maybe I should
say it "encourages" you to use a specific version?
Actually, I think "locks" is more appropriate. They're using Hadoop
0.18.3 with some feature backports (according to what they said to
me), so if you want features from a newer Hadoop (isn't 0.20 the
current release? It looked like it had a lot of new stuff), you're
pretty much done for.
Also, they charge extra for EMR jobs, which strikes me as a bit crazy
(see Greg Linden's comments about variable pricing), and may strike
some folks as a reason to run their own clusters.
As Ted and others pointed out, I think we would benefit from tools
that make it easy to add Mahout to an AMI.
Perhaps you could base it off of one of the Cloudera Hadoop AMIs?
They're publically available, and they handle all the Hadoop
business. I have no idea what the redistribution license would be,
and I am most definitely not a lawyer!
Steve
--
Stephen Green // [email protected]
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692