An attentive reader on the SenseClusters users list suggested the
following three papers as being particularly good if one is interested in
figuring out how to find the number of clusters automatically.

============================================================================
X-means: Extending K-means with Efficient Estimation of the Number of Clusters.
Pelleg and Moore, ICML-2000
http://www.cs.cmu.edu/~dpelleg/download/kmeans.pdf

With supporting commentary and code at:
http://www.cs.cmu.edu/~dpelleg/kmeans.html

Note that on his web page, Dan Pelleg mentions that the following similar
but independent method predates X-means...

An Efficient K-Means Clustering Algorithm, Alsabti, Ranka, and Singh
First Workshop on High-Performance Data Mining, 1998
ftp://ftp.cise.ufl.edu/pub/faculty/ranka/cluster.ps.gz

============================================================================
Document Clustering with Cluster Refinement and Model Selection Capabilities
Liu, Gong, Xu, and Zhu, SIGIR-2002
http://www.yow-now.com/xw/SIGIR02.pdf

Abstract: In this paper, we propose a document clustering method that
strives to achieve: (1) a high accuracy of document clustering, and (2)
the capability of estimating the number of clusters in the document corpus
(i.e. the model selection capability). To accurately cluster the given
document corpus, we employ a richer feature set to represent each
document, and use the Gaussian Mixture Model (GMM) together with the
Expectation-Maximization (EM) algorithm to conduct an initial document
clustering. From this initial result, we identify a set of discriminative
features for each cluster, and refine the initially obtained document
clusters by voting on the cluster label of each document using this
discriminative feature set. This self-refinement process of discriminative
feature identification and cluster label voting is iteratively applied
until the convergence of document clusters. On the other hand, the model
selection capability is achieved by introducing randomness in the cluster
initialization stage, and then discovering a value C for the number of
clusters N by which running the document clustering process for a fixed
number of times yields sufficiently similar results. Performance
evaluations exhibit clear superiority of the proposed method with its
improved document clustering and model selection accuracies. The
evaluations also demonstrate how each feature as well as the cluster
refinement process contribute to the document clustering accuracy.

[note the above also seems related to our interest in cluster labeling!]

============================================================================
Learning the k in k-means
Hamerly and Elkan
NIPS 2003
http://books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf

Abstract: When clustering a dataset, the right number $k$ of clusters to
use is often not obvious, and choosing k automatically is a hard
algorithmic problem. In this paper we present a new algorithm for choosing
k that is  based on a new statistical test for the hypothesis that a
subset of data follows a Gaussian distribution. The algorithm runs
k-means with increasing k until the test fails to reject the hypothesis
that the data assigned to each k-means center are Gaussian. We present
results from experiments on synthetic and real-world data showing that
the algorithm works well, and better than a recent method based on the
BIC penalty for model complexity.

============================================================================

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to