Hi, (I know this is an old topic -- but I am ressuscitating it on purpose !)
I've come across this article (Lafferty & Blei 2009) http://www.citeulike.org/user/maximzhao/article/5084329 which seems to build upon Ted's log likelihood ratio. The goal is exactly the original poster's question: how to characterize a topic cluster with its terms. Ted, I'd be interested in knowing your opinion on this article; most importantly, how easily it can be implemented and what improvement it brings over LLR. I hope this can help people on the list who are busy with topic clustering ! Sebastien 2009/8/12 Shashikant Kore <[email protected]> > I was referring to the condition where a phrase is identifies as good > by LLR and is also prominent feature of centroid. But, as you > clarified, only LLR score is good indicator for top labels. > > Thanks for the pointer for co-occurrence statistics. I will study some > literature on that. > > --shashi > > On Wed, Aug 12, 2009 at 11:23 PM, Ted Dunning<[email protected]> > wrote: > > On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <[email protected] > >wrote: > > > >> > >> Is this a necessary & sufficient condition for a good cluster label? > > > > > > I am not entirely clear what "this" is. My assertion is that high LLR > score > > is sufficient evidence to use the term or phrase. I generally also limit > > the number of terms as well, taking only the highest scoring ones. The > > necessary and sufficient phrase comes from a rigorous mathematical > > background that doesn't entirely apply here where we are talking about > > heuristics like this. > > > > > >> On a different note, is there any way to identify relationship among > >> the top labels of the clusters? For example, if I have cluster related > >> automobiles, I may get the companies (GM, Ford, Toyota) along with > >> their poupular models (Corolla, Cadillac, ) as top labels. How can I > >> figure out Toyota and Corolla are strongly related? > > > > > > Look at the co-occurrence statistics of the terms themselves. Use that > to > > form a sparse graph. Then do spectral clustering or agglomerative > > clustering on the graph. > > > > That will give you clusters of terms that will give you much of what you > > seek. Of course, the fact that the terms are being used to describe the > > same cluster means that you have a good chance of just replicating the > label > > sets for your clusters. > > > > -- > > Ted Dunning, CTO > > DeepDyve > > >
