Hello, I started getting to know with Apache Mahout clustering by running the quick start guide. I ran the Dirichlet clustering algorithm over the synthetic control data that are available in the example, but the results are not quite satisfactory. I noticed that the number of clusters is approximately correctly estimated, but the data are very mixed and the clusters are not well separated. What could be the reason for that? Does the proposed vector representation really lead to Gaussian distributed points?
I further proceeded with investigating this issue with putting a little bit more semantics. I produced 3-dimensional Vectors from the data with the following characteristics: dimension 1: the average angle of synthetic control signal line. It was estimated by running linear regression for each signal dimension 2: The number of times the synthetic control line crosses an average straight line dimension 3: The largest shift of values in any direction (up and down) When I ran the algorithm with these parameters I noticed that now data are classified correctly using L1 Distance measure (this measure turned out the give best results). However the algorithm becomes parametric, i.e. alpha parameter and the initial number of clusters highly affects the final results. In addition using Gaussian Clusters with this approach leads to bad results. Finally I found that first one should run the example as it is in order to get an overview of the number of clusters and then put some semantics in order to produce the correct clusters using some distance measure technique. In this respect my question is how should one approach to a new clustering problem? May be you could recommend me some stuff to read. In addition, during experimenting, I noticed several problems. If they are bugs I could report them: 1. AsymmetricSampledNormalModel does not work correctly. On line 125 it is enough only one of the probabilities to be 0, which makes the whole probability 0. The last thing happens because calculation of the exponent for very small numbers returns 0. The last thing happens when the standard deviation is too small and the distance in much higher. To fix this: isn't it better to take the initial sd-s in AsymmetricSampledNormalDistribution (sampleFromPrior method) based on the data (for example max data value). In the GaussianCluster for example the pdf is calculated different way - by summing the probabilities 2. CosineDistanceMeasure does not work correctly, because the initial clusters are with 0 centers and it is impossible to determine the angle between 0 vector and another vector. 3. MahalanobisDistanceMeasure cannot be used with Dirichlet clusterer, because the configure method is not called. Regards, Vasil
