[Senseclusters-users] Michael I. Jordan on cluster stopping, with digression to my PhD

ted pedersen Tue, 30 Aug 2005 20:32:32 -0700

One of the invited talks at ACL this summer (in Ann Arbor) was by Michael
Jordan, Professor of Statistics at Cal-Berkeley. He spoke on a technique
known as Latent Dirchelet Allocation, which can be used to automatically
determine the number of clusters found in a set of data.


Michael Jordan is quite prolific, but the following subset of papers seems
to touch upon the issues he talked about at ACL:

http://www.cs.berkeley.edu/~jordan/ir.html

Of particular note are the following papers, which seem to be the source
for much of information in the talk.

The following gives a comprehensive discussion of the method and it's
various applications, and it's rather complicated stuff in fact. :)
But, at it's core, they take a Bayesian view (this is akin to Generative
Models, which is the term sometimes used at ACL these days) where a
collection of data is viewed as a mixture model, and they then try to
separate out how many models (clusters) are *really* in the data.

Hierarchical Dirichlet processes. Y. W. Teh, M. I. Jordan, M. J. Beal and
D. M. Blei. Technical Report 653, Department of Statistics, University of
California, Berkeley, 2004.

http://www.cs.berkeley.edu/~jordan/papers/hdp.ps

They have applied this technique to identifying the number of topics in a
set of documents, and that work is described here:

Hierarchical topic models and the nested Chinese restaurant process. D. M.
Blei, T. Griffiths, M. I. Jordan, and J. Tenenbaum. In S. Thrun, L. Saul,
and B. Schoelkopf (Eds.), Advances in Neural Information Processing
Systems (NIPS) 16, 2004.

http://www.cs.berkeley.edu/~jordan/papers/lda-crp.pdf

They treat the problem of topic identification as one of model search,
where they are looking for the "tree" of topics that best describes the
data. This can be viewed in some respects as being like agglomerative
clustering, where you construct a tree of clusters.

Underlying the ideas in this talk is the notion of a Chinese Restaurant
Process. Imagine that you have a very large restaurant with an infinite
number of tables, and customers come in and sit at a table. The table that
each customer sits at is assigned via some distribution (the mixture model
distribution, I think). After all the customers are seated, then you are
done. :) If you picture each customer as an item to be clustered, and each
table as a cluster, then you get the idea, more or less. That's how they
propose to solve the cluster stopping problem. So, this is interesting
work, and well worth checking out, especially if you are of a statistical
frame of mind

BTW, if you are wondering about the Dirichlet distribution, it's the
conjugate prior of the multinomial distribution, which is typically what
we use to represent textual data.

Remarkably enough, my own PhD dissertation is somewhat related to some of
these ideas, for example Chapter 4, in particular the section on using
Gibbs Sampling to estimate the parameters of a Naive Bayesian model used
for clustering. I was not actually working on cluster stopping at that
time, so we would fix the number of clusters and then estimate the
parameters of a Naive Bayesian model in order to assign the instances to
clusters. We used both the EM algorithm and Gibbs Sampling to estimate
the parameters and do the cluster (in order to compare them). So, it's
not quite the same, it's much simpler in fact, but you will in fact find
the term "Dirichelet distribution" in my dissertation, which must surely
count for something. :) In any case, here it is:

http://www.d.umn.edu/~tpederse/Pubs/diss.ps.gz
Ted Pedersen, Learning Probabilistic Models of Word Sense Disambiguation
May 1998, Southern Methodist University (PhD Dissertation)

So one part of the dissertation was comparing how well first order
clustering methods (like we have in SenseClusters) performed in some
simple word sense discrimination experiments, compared to unsupervised
Naive Bayesian classifiers whose parameters were estimated using the EM
Algorithm and Gibbs Sampling. In general, we found that the first order
methods worked better, which is perhaps why SenseClusters relies on more
traditional methods of clustering than it does probabistic models.

Why did we use the Naive Bayesian classifier for unsupervised learning?
Because we found it to be very good in the supervised case. Not always the
best, but always good. So anyway, excuse my trip down memory lane. :)

Enjoy,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] Michael I. Jordan on cluster stopping, with digression to my PhD

Reply via email to