Re: GSoC 2009 proposition

Sean Owen Thu, 26 Feb 2009 07:16:36 -0800

Yes, I've already set up a recommender service using EC2. I copied my
documentation-in-progress on it which explains how it works, below.

As I think Ted said before, and I agree with, simply providing an
image with the libraries installed doesn't add value. A how-to is OK,
but is it much more than the concatentation of "how to get a machine
running on EC2" and "how to run Mahout stuff on a machine" which
already exist?

What I think it is useful to offer (well, at least the most useful
thing to offer) on EC2 are AMIs that act almost like a big RPC. You
put data in a location, fire up the AMI, it crunches as fast as
possible, saves output, and quits. That's what the AMI I put together
does and it works quite nicely.

(I agree that it's not a bad idea to still pay attention to the
single-machine case and not just Hadoop. Hadoop is a lot of overhead
but necessary at a certain scale. Below that scale, if you can fit on
one machine, it's obviously a lot quicker. EC2 does offer pretty big
machines...)

Anyway, food for thought on this topic...

-----------
An AMI which employs Apache Mahout's 'Taste' collaborative filtering
engine (of which I am developer) to efficiently generate
recommendations based on user preferences -- think of Amazon's book
recommendations for an idea of what this does. For example if your
business sells CDs, this service could determine which CDs to
recommend to your users for purchase, based on ratings you have
already. This service makes it simple and cost-effective for
businesses to leverage this technology.

This AMI requires that you supply one run-time parameter:

dataBucket: A bucket where your input is stored and output will be stored

This is specified on the command line to ec2-run-instances as "-d
dataBucket=[data bucket name]"

The bucket must grant both read and write permission, and the
"in.txt.gz" file within it (described next) must grant read access, to
the following canonical user ID:

c8453526c3ec4d3c2d3b7ecc654c8e3e4fbf006d595d7310def17047c28c58ab

This enables the service to read your input and write output back to
the bucket. It is advised that you do not store any other data here
for security.

The bucket named by dataBucket should contain an input file named
"in.txt.gz". This should be a GZip-compressed text file, containing
comma-separated lines specify user-item preferences. That is, each
line should be of the form:

[user ID],[item ID],[preference value]

During operation, a file called iterations.txt in this bucket will be
updated with the number of users processed so far. At completion, the
bucket will contain log.txt, with output from the run, and out.txt.gz,
a GZip-compressed file containing comma-separated values, where each
line is of the form:

[user ID],[item ID],[estimated preference value]

All lines for a user ID will be grouped, and will be sorted,
descending, by preference value.

This AMI is intended for use with 64-bit instance types: m1.large,
m1.xlarge, c1.xlarge

This service is appropriate for small- and medium-sized businesses --
roughly speaking, up to 10M user-item preferences. As a rough guide,
on a c1.xlarge instance, using the GroupLens 10M rating data set,
recommendations can be generated for all users in about 4 hours, at a
cost of about $10.
--------------

On Thu, Feb 26, 2009 at 2:58 PM, deneche abdelhakim <a_dene...@yahoo.fr> wrote:
>
> Hi,
> Im planning to participate, again, at GSoC and I want to do it, again, with 
> Mahout.
> This year, lets make Mahout run over Amazon EC2. This means building the 
> proper AMIs, run some Mahout projects (the GA examples) over EC2, give 
> feedback and write simple, clear How-Tos about running a Mahout project on 
> EC2.
>
> The Mahout.GA examples (TSP and CDGA) should be good real-world scenarios 
> about how one may need to use Mahout.GA on EC2. The TSP example should be 
> modified to be able to run on a console and to load TSPLIB benchmarks, thus 
> we can tackle more challenging TSP problems with the help of EC2. The CDGA 
> example should run unmodified given, of course, that Hadoop is configured 
> correctly on EC2 and the the benchmark is on HDFS.
>
> This two examples will give us three use cases about Mahout on EC2:
>
> 1. TSP can be run on a single, High-CPU, EC2 instance. In this case, 
> Watchmaker's ConcurrentEvolutionEngine should take care of the 
> multi-threading part (or at least I hope!) and there will be no need for 
> Hadoop;
>
> 2. TSP can also be run over multiple EC2 instances with the help of Hadoop;
>
> 3. CDGA not only needs Hadoop to run, but its data should be on HDFS.
>
>
> So what do you think, is the "elephant" ready for a walk on EC2 ?
>
>
>
>

Re: GSoC 2009 proposition

Reply via email to