Re: Clustering techniques, tips and tricks

Ted Dunning Mon, 04 Jan 2010 22:32:08 -0800

This definition isn't quite solid.  Probably the problem is just writing it
down in English without mathematical notation.

On Mon, Jan 4, 2010 at 7:32 PM, Rao, Vaijanath
<[email protected]>wrote:

> The idea is to have two distances one each for
> A) Across the cluster
> B) Within the cluster
>

What do you mean be these, exactly?

Are these distances from a point to a cluster?  Between clusters?  The
maximum or minimum of all distances from a point to members of the cluster?

In the iter-1 I allow inifite distance for both of these distance and run
> the regular logic of clustering. At the end of it calculate the min and max
> for both of these distances.
>

Regular logic?  Do you mean k-means?

Min and max across all points against the clusters they are in?  Or against
all clusters?

>  For the remaining iterations do the following
>

To each data point?

>
> A) find the Lowest distance cluster for the input point. If the distance of
> the input point increases the distance within the cluster then the
> previously assigned then ignore it and create a new cluster with this point.
>
>

What do you mean by increases the distance within the cluster?  The sum of
all distances from points in a cluster to the centroid?  Or do you mean to
say that   the new assignment is farther than any previous member of the
cluster from the centroid?

What are the new min and max distances for the new cluster?

If it does not voilate the older distance keep it with following probability
>        1. 1 if the distance of Input point is less then min distance.
>

Min over what?  for just this cluster?

>         2. in the ratio of (max-distance of the input from
> center)/max_distance
>

These two probabilities to not have the same limit for distances just larger
and just smaller than the min distance.  Do you mean that the second should
be (max-distance to center) / (max - min) ?

What do you mean be "keep it"?  Keep the assignment?  What if you don't keep
the assignment?   Do you then form a new cluster with this point?  If you do
that, what is the min and max distance for the new cluster?

B) If there are two clusters which have cluster distance less then the
> minimum cluster distance then merge them and calculate the within cluster
> distance and across cluster distance.
>

What is "cluster distance"?  What is "minimum cluster distance" ?  How can
any particular cluster distance be smaller than the smallest cluster
distance?

> I have observed that this has helped me in getting good clusters. Currently
> I have a single version code which has the above calculation and I am
> wokring on getting this into Map-red and should be able to submit a patch
> very soon.
>

Does single version code mean sequential code?

If so, don't wait until you have a parallel version!   File a JIRA and post
what you have!

>  Let me know If you think this will generally work.
>

I think that algorithms like this heuristic have a good chance of working.
I don't see that this particular algorithm is going to handle high
dimensions very well, nor do I think it will necessarily converge because it
looks like there is always a good chance that cluster members will be
assigned to a new cluster.  In order to improve convergence, you might
consider annealing this new cluster behavior, possibly taking 1 - Math.pow(1
- ratio-as-you-defined, iteration).  This will cause the clusters to become
more and more stable.

You may not be interested in a stable clustering.  The Dirichlet clustering
that we have is already a non-deterministic clustering algorithm that
produces a distribution over possible numbers of clusters and cluster
assignments.  Perhaps that is what your algorithm should be creating.

I am also curious about whether your algorithm is a variant of something
like neural gas clustering.

Re: Clustering techniques, tips and tricks

Reply via email to