This definition isn't quite solid. Probably the problem is just writing it down in English without mathematical notation.
On Mon, Jan 4, 2010 at 7:32 PM, Rao, Vaijanath <[email protected]>wrote: > The idea is to have two distances one each for > A) Across the cluster > B) Within the cluster > What do you mean be these, exactly? Are these distances from a point to a cluster? Between clusters? The maximum or minimum of all distances from a point to members of the cluster? In the iter-1 I allow inifite distance for both of these distance and run > the regular logic of clustering. At the end of it calculate the min and max > for both of these distances. > Regular logic? Do you mean k-means? Min and max across all points against the clusters they are in? Or against all clusters? > For the remaining iterations do the following > To each data point? > > A) find the Lowest distance cluster for the input point. If the distance of > the input point increases the distance within the cluster then the > previously assigned then ignore it and create a new cluster with this point. > > What do you mean by increases the distance within the cluster? The sum of all distances from points in a cluster to the centroid? Or do you mean to say that the new assignment is farther than any previous member of the cluster from the centroid? What are the new min and max distances for the new cluster? If it does not voilate the older distance keep it with following probability > 1. 1 if the distance of Input point is less then min distance. > Min over what? for just this cluster? > 2. in the ratio of (max-distance of the input from > center)/max_distance > These two probabilities to not have the same limit for distances just larger and just smaller than the min distance. Do you mean that the second should be (max-distance to center) / (max - min) ? What do you mean be "keep it"? Keep the assignment? What if you don't keep the assignment? Do you then form a new cluster with this point? If you do that, what is the min and max distance for the new cluster? B) If there are two clusters which have cluster distance less then the > minimum cluster distance then merge them and calculate the within cluster > distance and across cluster distance. > What is "cluster distance"? What is "minimum cluster distance" ? How can any particular cluster distance be smaller than the smallest cluster distance? > I have observed that this has helped me in getting good clusters. Currently > I have a single version code which has the above calculation and I am > wokring on getting this into Map-red and should be able to submit a patch > very soon. > Does single version code mean sequential code? If so, don't wait until you have a parallel version! File a JIRA and post what you have! > Let me know If you think this will generally work. > I think that algorithms like this heuristic have a good chance of working. I don't see that this particular algorithm is going to handle high dimensions very well, nor do I think it will necessarily converge because it looks like there is always a good chance that cluster members will be assigned to a new cluster. In order to improve convergence, you might consider annealing this new cluster behavior, possibly taking 1 - Math.pow(1 - ratio-as-you-defined, iteration). This will cause the clusters to become more and more stable. You may not be interested in a stable clustering. The Dirichlet clustering that we have is already a non-deterministic clustering algorithm that produces a distribution over possible numbers of clusters and cluster assignments. Perhaps that is what your algorithm should be creating. I am also curious about whether your algorithm is a variant of something like neural gas clustering.
