Re: Most Frequently Used Clustering Algorithm

Kurt Watzka Sat, 17 Nov 2001 12:45:00 -0800

"Chia C Chong" <[EMAIL PROTECTED]> writes:

>Hi!


>I am new in this area..I wonder which clustering algorithm is the most
>frequently used and maybe the most robust??

This question has may levels, ranging from the decision between agglomerative
and "seed based" methods, touching the choice of an appropriate measure of
diatance or similarity and terminating in the descision on a method to
form clusters.

I will try to answer the question of choosing one of the classic 
agglomerations method for agglomerative hierarchical cluster analysis.  
The choice may depend on what you are trying to achive. 

If you want to detect outliers, single linkage is the method of choice. 
Observations that are joined very late, and at rather high levels of 
dissimlarity are potential canditates for further inspection 
(probable outliers). The disadvantage of single linkage is that two
groups can be "joined" at an early stage if there is a single observation
that formas a "bridge" between them. 

The complete linkage method has a tendency to form small homogenous
clusters at an early stage, but because the distance between clusters
is defined as the dinstance between their most dissimilar members, 
clusters that are in fact quite similar can stay separate until 
quite a late stage of the agglomeration process. 

Ward's method will stress the demand for homogentiy within a cluster,
but it will probably not be your tool of choice if you are interested
in detecting sturctures in your data that go beyond mere "within
cluster homogenity".

Average linkage will be computationaly expensive, with may or may not
be a point to take into consideration depending on the size of your
data set, but avoids some disadvantages of the other methods, depending
on what you are trying to achive.

Maybe the most important point to make about cluster analysis was made 
by Fowlkes et al. (1987, Variable selection in clustering and other contexts): 
"In the murky area of cluster analysis, where there is so little guiding
theory, informal graphical approaches which can be used in a highly
interactive manner are not only very useful but perhaps even essential
for getting the job done."

There is no silver bullet for detecting clusters. The important thing
is to look at your results in connection with your data. A useful 
technique is to use a graphical display of your data to visualize and
evaluate different approaches to detect clusters.

Kurt

-- 
| Kurt Watzka                             
| [EMAIL PROTECTED]


=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================

Re: Most Frequently Used Clustering Algorithm

Reply via email to