Re: Mahout algorithms guide

Olivier Grisel Fri, 08 Jan 2010 02:39:36 -0800

2010/1/7 Bogdan Vatkov <[email protected]>:
> Hi,
>
> I am wondering if the different algorithms available @ Mahout have different
> results and different behavior (e.g. performance - memory, speed, etc.) and
> if yes could we have some short (2-3 sentences per alg.) description of the
> different algs.
> For example how they perform in different conditions: e.g. how they behave
> related to:
> - documents amount
> - documents average size
> - documents of very different sizes (e.g. half of the docs are very small
> and the other half very large - would either of the doc sizes win for some
> reason during clustering)
> - cluster size
> - documents amount to cluster size ratio
> - memory needed
> - time needed
>
> For example I am right now interested in clustering of documents:
> - of close size (most of the documents have size very close to the average
> size)
> - ratio between docs and clusters desired is 23 000 : 80 (or maybe even : 40
> and :20)
> Which Mahout algorithm and using which parameters is recommended for my
> case?
>
> Of course I should be able to run my data through all possible algorithms
> and then try to compare results - but it would be good to know if using one
> or another algorithm would lead to one or another flavor of the result -
> especially if this is already known based on the specifics of the
> algorithms.


That would be indeed great to have a global ready-to-run benchmark /
shootout driver that runs all the available algorithms for a given
task (e.g. document clustering, classification, ...) that could be run
on Amazon Elastic MapReduce with a couple of clicks by using the data
already avaible on a public S3 account.

The results would be a comparative shootout report (with performance
measures) and published on the mahout website regularly.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Re: Mahout algorithms guide

Reply via email to