Hi Ted, My replies are inside
-----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Tuesday, January 05, 2010 12:01 PM To: [email protected] Subject: Re: Clustering techniques, tips and tricks This definition isn't quite solid. Probably the problem is just writing it down in English without mathematical notation. On Mon, Jan 4, 2010 at 7:32 PM, Rao, Vaijanath <[email protected]>wrote: > The idea is to have two distances one each for > A) Across the cluster > B) Within the cluster > Q? What do you mean be these, exactly? Are these distances from a point to a cluster? Between clusters? The maximum or minimum of all distances from a point to members of the cluster? --For Within the cluster distance A cluster will have some points which are near to centroid and some may be farther from the centroid. Distance within cluster implies the distance of the point and the centroid of the cluster in which the point belongs. The min distance represent the minimum distance a point has with the centroid ( which is near to centroid), whereas the max distance represents the maximum distance a point has with the centroid of same cluster. For across the cluster distance This is the distance between the cetroids of the clusters. The min distance here represents the minimum distance between two such cluster centroids and max distance represents the maximum distance between the centroids of two clusters. Ideally we want any clustering algorithm to have smaller intra-cluster distance ( distance between centroid and the point within the cluster ) and larger inter-cluster distance ( distance between two cluster centroids ) In the iter-1 I allow inifite distance for both of these distance and run > the regular logic of clustering. At the end of it calculate the min > and max for both of these distances. > Q? Regular logic? Do you mean k-means? --Regular logic here implies the selection of the cluster based on data point. In case od K-Means it is minimum distance. For Some other algorithm it will be within this distance or threashold and so on. So the assignment of data point to cluster remains as per the Choice of Clustering algorithm > For the remaining iterations do the following > Q? To each data point? --Yes for every data point > > A) find the Lowest distance cluster for the input point. If the > distance of the input point increases the distance within the cluster > then the previously assigned then ignore it and create a new cluster with this point. > > Q? What do you mean by increases the distance within the cluster? The sum of all distances from points in a cluster to the centroid? Or do you mean to say that the new assignment is farther than any previous member of the cluster from the centroid? --Here I meant that when the data point is assigned to this cluster it incrreased the within cluster ( intra-cluster distance ) Q? What are the new min and max distances for the new cluster? For new clusters that are created it will be set to infinite but if the cluster already exisit the new min and max will be claculated based on the points that are present in that cluster. If it does not voilate the older distance keep it with following probability > 1. 1 if the distance of Input point is less then min distance. > Q? Min over what? for just this cluster? Yes, if new point assigned to this cluster is smaller then the prev min, then it's added as is to centroid else this datapoint will carry a smaller weight when added to centroid > 2. in the ratio of (max-distance of the input from > center)/max_distance > Q? These two probabilities to not have the same limit for distances just larger and just smaller than the min distance. Do you mean that the second should be (max-distance to center) / (max - min) ? I haven't played with the weightage, but your eq should give the values in the range of 0-1 which defines the wt of the datapoint carries when the centroid is calculated. I will use this Q? What do you mean be "keep it"? Keep the assignment? What if you don't keep the assignment? Do you then form a new cluster with this point? If you do that, what is the min and max distance for the new cluster? --If the data point is not assigned to any cluster then it becomes a new cluster with min and max distance set to infinite. Which means when B) If there are two clusters which have cluster distance less then the > minimum cluster distance then merge them and calculate the within > cluster distance and across cluster distance. > Q? What is "cluster distance"? What is "minimum cluster distance" ? How can any particular cluster distance be smaller than the smallest cluster distance? --Here what I meant was if after the execution of current iteration, if there are two cluster centroids which have distance less then the last iteration minimum inter-cluster distance, then we can merge these two clusters to get one big cluster. At this point we need to recalculate the inter-cluster distances and intra-cluster distances. > I have observed that this has helped me in getting good clusters. > Currently I have a single version code which has the above calculation > and I am wokring on getting this into Map-red and should be able to > submit a patch very soon. > Q? Does single version code mean sequential code? --Yes Q?If so, don't wait until you have a parallel version! File a JIRA and post what you have! --I will definitely do that > Let me know If you think this will generally work. > Q? I think that algorithms like this heuristic have a good chance of working. I don't see that this particular algorithm is going to handle high dimensions very well, nor do I think it will necessarily converge because it looks like there is always a good chance that cluster members will be assigned to a new cluster. In order to improve convergence, you might consider annealing this new cluster behavior, possibly taking 1 - Math.pow(1 - ratio-as-you-defined, iteration). This will cause the clusters to become more and more stable. --I don't have any such idea on how good the convergernce is, but I will try to add annealing as this will help it to be more stable. Q? You may not be interested in a stable clustering. The Dirichlet clustering that we have is already a non-deterministic clustering algorithm that produces a distribution over possible numbers of clusters and cluster assignments. Perhaps that is what your algorithm should be creating. --I need to look at this. As I wanted to have more stable clusters not sure If I am going in the right direction to get those. Q? I am also curious about whether your algorithm is a variant of something like neural gas clustering. --Assuming the cluster's grows and shrinks over a preroid of time then it may be some form of neural gas. But cannot assume that this is neural gas. As far as I know the neural gas has lot many features than what I have written about. --Thanks and Regards Vaijanath N. Rao
