One of the invited talks at ACL this summer (in Ann Arbor) was by Michael Jordan, Professor of Statistics at Cal-Berkeley. He spoke on a technique known as Latent Dirchelet Allocation, which can be used to automatically determine the number of clusters found in a set of data.
Michael Jordan is quite prolific, but the following subset of papers seems to touch upon the issues he talked about at ACL: http://www.cs.berkeley.edu/~jordan/ir.html Of particular note are the following papers, which seem to be the source for much of information in the talk. The following gives a comprehensive discussion of the method and it's various applications, and it's rather complicated stuff in fact. :) But, at it's core, they take a Bayesian view (this is akin to Generative Models, which is the term sometimes used at ACL these days) where a collection of data is viewed as a mixture model, and they then try to separate out how many models (clusters) are *really* in the data. Hierarchical Dirichlet processes. Y. W. Teh, M. I. Jordan, M. J. Beal and D. M. Blei. Technical Report 653, Department of Statistics, University of California, Berkeley, 2004. http://www.cs.berkeley.edu/~jordan/papers/hdp.ps They have applied this technique to identifying the number of topics in a set of documents, and that work is described here: Hierarchical topic models and the nested Chinese restaurant process. D. M. Blei, T. Griffiths, M. I. Jordan, and J. Tenenbaum. In S. Thrun, L. Saul, and B. Schoelkopf (Eds.), Advances in Neural Information Processing Systems (NIPS) 16, 2004. http://www.cs.berkeley.edu/~jordan/papers/lda-crp.pdf They treat the problem of topic identification as one of model search, where they are looking for the "tree" of topics that best describes the data. This can be viewed in some respects as being like agglomerative clustering, where you construct a tree of clusters. Underlying the ideas in this talk is the notion of a Chinese Restaurant Process. Imagine that you have a very large restaurant with an infinite number of tables, and customers come in and sit at a table. The table that each customer sits at is assigned via some distribution (the mixture model distribution, I think). After all the customers are seated, then you are done. :) If you picture each customer as an item to be clustered, and each table as a cluster, then you get the idea, more or less. That's how they propose to solve the cluster stopping problem. So, this is interesting work, and well worth checking out, especially if you are of a statistical frame of mind BTW, if you are wondering about the Dirichlet distribution, it's the conjugate prior of the multinomial distribution, which is typically what we use to represent textual data. Remarkably enough, my own PhD dissertation is somewhat related to some of these ideas, for example Chapter 4, in particular the section on using Gibbs Sampling to estimate the parameters of a Naive Bayesian model used for clustering. I was not actually working on cluster stopping at that time, so we would fix the number of clusters and then estimate the parameters of a Naive Bayesian model in order to assign the instances to clusters. We used both the EM algorithm and Gibbs Sampling to estimate the parameters and do the cluster (in order to compare them). So, it's not quite the same, it's much simpler in fact, but you will in fact find the term "Dirichelet distribution" in my dissertation, which must surely count for something. :) In any case, here it is: http://www.d.umn.edu/~tpederse/Pubs/diss.ps.gz Ted Pedersen, Learning Probabilistic Models of Word Sense Disambiguation May 1998, Southern Methodist University (PhD Dissertation) So one part of the dissertation was comparing how well first order clustering methods (like we have in SenseClusters) performed in some simple word sense discrimination experiments, compared to unsupervised Naive Bayesian classifiers whose parameters were estimated using the EM Algorithm and Gibbs Sampling. In general, we found that the first order methods worked better, which is perhaps why SenseClusters relies on more traditional methods of clustering than it does probabistic models. Why did we use the Naive Bayesian classifier for unsupervised learning? Because we found it to be very good in the supervised case. Not always the best, but always good. So anyway, excuse my trip down memory lane. :) Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
