Resuscitating this again... So, I committed MAHOUT-163 (thanks, Shashi!) which implements Ted's log likelihood ideas and I've been trying it out and also comparing it to what Carrot2 does for generating labels. One of the things that I think would make sense is to extend MAHOUT-163 to have the option to return phrases instead of just terms. My first thought is to just create an n-gram model of the same field I'm clustering on (as that will allow the existing code to work unmodified), but I wanted to hear what others think. Is it worth the time?
I'm also interested in other approaches people have taken. -Grant On Sep 5, 2009, at 4:58 PM, Sebastien Bratieres wrote: > Hi, > > (I know this is an old topic -- but I am ressuscitating it on purpose !) > > I've come across this article (Lafferty & Blei 2009) > http://www.citeulike.org/user/maximzhao/article/5084329 which seems to build > upon Ted's log likelihood ratio. The goal is exactly the original poster's > question: how to characterize a topic cluster with its terms. > Ted, I'd be interested in knowing your opinion on this article; most > importantly, how easily it can be implemented and what improvement it brings > over LLR. > > I hope this can help people on the list who are busy with topic clustering ! > > Sebastien > > > 2009/8/12 Shashikant Kore <[email protected]> > >> I was referring to the condition where a phrase is identifies as good >> by LLR and is also prominent feature of centroid. But, as you >> clarified, only LLR score is good indicator for top labels. >> >> Thanks for the pointer for co-occurrence statistics. I will study some >> literature on that. >> >> --shashi >> >> On Wed, Aug 12, 2009 at 11:23 PM, Ted Dunning<[email protected]> >> wrote: >>> On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <[email protected] >>> wrote: >>> >>>> >>>> Is this a necessary & sufficient condition for a good cluster label? >>> >>> >>> I am not entirely clear what "this" is. My assertion is that high LLR >> score >>> is sufficient evidence to use the term or phrase. I generally also limit >>> the number of terms as well, taking only the highest scoring ones. The >>> necessary and sufficient phrase comes from a rigorous mathematical >>> background that doesn't entirely apply here where we are talking about >>> heuristics like this. >>> >>> >>>> On a different note, is there any way to identify relationship among >>>> the top labels of the clusters? For example, if I have cluster related >>>> automobiles, I may get the companies (GM, Ford, Toyota) along with >>>> their poupular models (Corolla, Cadillac, ) as top labels. How can I >>>> figure out Toyota and Corolla are strongly related? >>> >>> >>> Look at the co-occurrence statistics of the terms themselves. Use that >> to >>> form a sparse graph. Then do spectral clustering or agglomerative >>> clustering on the graph. >>> >>> That will give you clusters of terms that will give you much of what you >>> seek. Of course, the fact that the terms are being used to describe the >>> same cluster means that you have a good chance of just replicating the >> label >>> sets for your clusters. >>> >>> -- >>> Ted Dunning, CTO >>> DeepDyve >>> >>
