RE: Clustering techniques, tips and tricks

Rao, Vaijanath Mon, 04 Jan 2010 23:13:16 -0800

Hi Ted, 

My replies are inside

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Tuesday, January 05, 2010 12:01 PM
To: [email protected]
Subject: Re: Clustering techniques, tips and tricks

This definition isn't quite solid.  Probably the problem is just writing
it down in English without mathematical notation.

On Mon, Jan 4, 2010 at 7:32 PM, Rao, Vaijanath
<[email protected]>wrote:

> The idea is to have two distances one each for
> A) Across the cluster
> B) Within the cluster
>

Q? What do you mean be these, exactly?
Are these distances from a point to a cluster?  Between clusters?  The
maximum or minimum of all distances from a point to members of the
cluster?

--For Within the cluster distance
A cluster will have some points which are near to centroid and some may
be farther from the centroid. Distance within cluster implies the
distance of the point and the centroid of the cluster in which the point
belongs. The min distance represent the minimum distance a point has
with the centroid ( which is near to centroid), whereas the max distance
represents the maximum distance a point has with the centroid of same
cluster.

For across the cluster distance
This is the distance between the cetroids of the clusters. The min
distance here represents the minimum distance between two such cluster
centroids and max distance represents the maximum distance between the
centroids of two clusters.

Ideally we want any clustering algorithm to have smaller intra-cluster
distance ( distance between centroid and the point within the cluster )
and larger inter-cluster distance ( distance between two cluster
centroids )

In the iter-1 I allow inifite distance for both of these distance and
run
> the regular logic of clustering. At the end of it calculate the min 
> and max for both of these distances.
>

Q? Regular logic?  Do you mean k-means?
--Regular logic here implies the selection of the cluster based on data
point. In case od K-Means it is minimum distance. For Some other
algorithm it will be within this distance or threashold and so on. So
the assignment of data point to cluster remains as per the Choice of
Clustering algorithm

>  For the remaining iterations do the following
>

Q? To each data point?
--Yes for every data point

>
> A) find the Lowest distance cluster for the input point. If the 
> distance of the input point increases the distance within the cluster 
> then the previously assigned then ignore it and create a new cluster
with this point.
>
>

Q? What do you mean by increases the distance within the cluster?  The
sum of all distances from points in a cluster to the centroid?  Or do
you mean to
say that   the new assignment is farther than any previous member of the
cluster from the centroid?

--Here I meant that when the data point is assigned to this cluster it
incrreased the within cluster ( intra-cluster distance )

Q? What are the new min and max distances for the new cluster?
For new clusters that are created it will be set to infinite but if the
cluster already exisit the new min and max will be claculated based on
the points that are present in that cluster.

If it does not voilate the older distance keep it with following
probability
>        1. 1 if the distance of Input point is less then min distance.
>

Q? Min over what?  for just this cluster?
Yes, if new point assigned to this cluster is smaller then the prev min,
then it's added as is to centroid else this datapoint will carry a
smaller weight when added to centroid 

>         2. in the ratio of (max-distance of the input from 
> center)/max_distance
>

Q? These two probabilities to not have the same limit for distances just
larger and just smaller than the min distance.  Do you mean that the
second should be (max-distance to center) / (max - min) ?

I haven't played with the weightage, but your eq should give the values
in the range of 0-1 which defines the wt of the datapoint carries when
the centroid is calculated. I will use this

Q? What do you mean be "keep it"?  Keep the assignment?  What if you
don't keep
the assignment?   Do you then form a new cluster with this point?  If
you do
that, what is the min and max distance for the new cluster?

--If the data point is not assigned to any cluster then it becomes a new
cluster with min and max distance set to infinite. Which means when 

B) If there are two clusters which have cluster distance less then the
> minimum cluster distance then merge them and calculate the within 
> cluster distance and across cluster distance.
>

Q? What is "cluster distance"?  What is "minimum cluster distance" ?
How can any particular cluster distance be smaller than the smallest
cluster distance?
--Here what I meant was if after the execution of current iteration, if
there are two cluster centroids which have distance less then the last
iteration minimum inter-cluster distance, then we can merge these two
clusters to get one big cluster. At this point we need to recalculate
the inter-cluster distances and intra-cluster distances.

> I have observed that this has helped me in getting good clusters. 
> Currently I have a single version code which has the above calculation

> and I am wokring on getting this into Map-red and should be able to 
> submit a patch very soon.
>

Q? Does single version code mean sequential code?
--Yes

Q?If so, don't wait until you have a parallel version!   File a JIRA and
post
what you have!

--I will definitely do that

>  Let me know If you think this will generally work.
>

Q? I think that algorithms like this heuristic have a good chance of
working.
I don't see that this particular algorithm is going to handle high
dimensions very well, nor do I think it will necessarily converge
because it looks like there is always a good chance that cluster members
will be assigned to a new cluster.  In order to improve convergence, you
might consider annealing this new cluster behavior, possibly taking 1 -
Math.pow(1
- ratio-as-you-defined, iteration).  This will cause the clusters to
become more and more stable.

--I don't have any such idea on how good the convergernce is, but I will
try to add annealing as this will help it to be more stable.

Q? You may not be interested in a stable clustering.  The Dirichlet
clustering that we have is already a non-deterministic clustering
algorithm that produces a distribution over possible numbers of clusters
and cluster assignments.  Perhaps that is what your algorithm should be
creating.

--I need to look at this. As I wanted to have more stable clusters not
sure If I am going in the right direction to get those.

Q? I am also curious about whether your algorithm is a variant of
something like neural gas clustering.
--Assuming the cluster's grows and shrinks over a preroid of time then
it may be some form of neural gas. But cannot assume that this is neural
gas. As far as I know the neural gas has lot many features than what I
have written about.

--Thanks and Regards
Vaijanath N. Rao

RE: Clustering techniques, tips and tricks

Reply via email to