On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <[email protected]>wrote:
> > Is this a necessary & sufficient condition for a good cluster label? I am not entirely clear what "this" is. My assertion is that high LLR score is sufficient evidence to use the term or phrase. I generally also limit the number of terms as well, taking only the highest scoring ones. The necessary and sufficient phrase comes from a rigorous mathematical background that doesn't entirely apply here where we are talking about heuristics like this. > On a different note, is there any way to identify relationship among > the top labels of the clusters? For example, if I have cluster related > automobiles, I may get the companies (GM, Ford, Toyota) along with > their poupular models (Corolla, Cadillac, ) as top labels. How can I > figure out Toyota and Corolla are strongly related? Look at the co-occurrence statistics of the terms themselves. Use that to form a sparse graph. Then do spectral clustering or agglomerative clustering on the graph. That will give you clusters of terms that will give you much of what you seek. Of course, the fact that the terms are being used to describe the same cluster means that you have a good chance of just replicating the label sets for your clusters. -- Ted Dunning, CTO DeepDyve
