Hi, I am wondering if the different algorithms available @ Mahout have different results and different behavior (e.g. performance - memory, speed, etc.) and if yes could we have some short (2-3 sentences per alg.) description of the different algs. For example how they perform in different conditions: e.g. how they behave related to: - documents amount - documents average size - documents of very different sizes (e.g. half of the docs are very small and the other half very large - would either of the doc sizes win for some reason during clustering) - cluster size - documents amount to cluster size ratio - memory needed - time needed
For example I am right now interested in clustering of documents: - of close size (most of the documents have size very close to the average size) - ratio between docs and clusters desired is 23 000 : 80 (or maybe even : 40 and :20) Which Mahout algorithm and using which parameters is recommended for my case? Of course I should be able to run my data through all possible algorithms and then try to compare results - but it would be good to know if using one or another algorithm would lead to one or another flavor of the result - especially if this is already known based on the specifics of the algorithms. Best regards, Bogdan
