2010/1/7 Bogdan Vatkov <[email protected]>: > Hi, > > I am wondering if the different algorithms available @ Mahout have different > results and different behavior (e.g. performance - memory, speed, etc.) and > if yes could we have some short (2-3 sentences per alg.) description of the > different algs. > For example how they perform in different conditions: e.g. how they behave > related to: > - documents amount > - documents average size > - documents of very different sizes (e.g. half of the docs are very small > and the other half very large - would either of the doc sizes win for some > reason during clustering) > - cluster size > - documents amount to cluster size ratio > - memory needed > - time needed > > For example I am right now interested in clustering of documents: > - of close size (most of the documents have size very close to the average > size) > - ratio between docs and clusters desired is 23 000 : 80 (or maybe even : 40 > and :20) > Which Mahout algorithm and using which parameters is recommended for my > case? > > Of course I should be able to run my data through all possible algorithms > and then try to compare results - but it would be good to know if using one > or another algorithm would lead to one or another flavor of the result - > especially if this is already known based on the specifics of the > algorithms.
That would be indeed great to have a global ready-to-run benchmark / shootout driver that runs all the available algorithms for a given task (e.g. document clustering, classification, ...) that could be run on Amazon Elastic MapReduce with a couple of clicks by using the data already avaible on a public S3 account. The results would be a comparative shootout report (with performance measures) and published on the mahout website regularly. -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name
