2010/1/18 Grant Ingersoll <[email protected]>: > OK, thanks for all the advice. I'm wondering if this makes sense:' > > Create an AMI with: > 1. Java 1.6 > 2. Maven > 3. svn > 4. Mahout's exact Hadoop version > 5. A checkout of Mahout
I am running CDH2 with hadoop currently in version 0.20.1+152-1~j (using cloudera's intrepid-testing apt repo on a regular ubuntu karmic distro) with on my 2 dev boxes (one is 32bit bi core and one is 64bit quad core) in conf-pseudo (single node cIuser). I could successfully run mahout-0.3-SNAPSHOT jobs (including the hadoop-0.20.2-SNAPSHOT. I guess this would run exactly the same on a real EC2 cluster setup with http://archive.cloudera.com/docs/ec2.html . > I want to be able to run the trunk version of Mahout with little upgrade > pain, both on an individual node and in a cluster. > > Is this the shortest path? I don't have much experience w/ creating AMIs, > but I want my work to be reusable by the community (remember, committers can > get credits from Amazon for testing Mahout) > > After that, I want to convert some of the public datasets to vector format > and run some performance benchmarks. I think we should host sample datasets that are know to be vectorizable using mahout utilities either on S3 (using s3:// and not s3n:// when individual files are larger than 5GB) or using a dedicated EBS volume with a public snapshot. -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name
