Hi,

(I know this is an old topic -- but I am ressuscitating it on purpose !)

I've come across this article (Lafferty & Blei 2009)
http://www.citeulike.org/user/maximzhao/article/5084329 which seems to build
upon Ted's log likelihood ratio. The goal is exactly the original poster's
question: how to characterize a topic cluster with its terms.
Ted, I'd be interested in knowing your opinion on this article; most
importantly, how easily it can be implemented and what improvement it brings
over LLR.

I hope this can help people on the list who are busy with topic clustering !

Sebastien


2009/8/12 Shashikant Kore <[email protected]>

> I was referring to the condition where a phrase is identifies as good
> by LLR and is also prominent feature of centroid.  But, as you
> clarified, only LLR score is good indicator for top labels.
>
> Thanks for the pointer for co-occurrence statistics. I will study some
> literature on that.
>
> --shashi
>
> On Wed, Aug 12, 2009 at 11:23 PM, Ted Dunning<[email protected]>
> wrote:
> > On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <[email protected]
> >wrote:
> >
> >>
> >> Is this a necessary & sufficient  condition for a good cluster label?
> >
> >
> > I am not entirely clear what "this" is.  My assertion is that high LLR
> score
> > is sufficient evidence to use the term or phrase.  I generally also limit
> > the number of terms as well, taking only the highest scoring ones.  The
> > necessary and sufficient phrase comes from a rigorous mathematical
> > background that doesn't entirely apply here where we are talking about
> > heuristics like this.
> >
> >
> >> On a different note,  is there any way to identify relationship among
> >> the top labels of the clusters? For example, if I have cluster related
> >> automobiles, I may get the companies (GM, Ford, Toyota) along with
> >> their poupular models (Corolla,  Cadillac, ) as top labels. How can I
> >> figure out Toyota and Corolla are strongly related?
> >
> >
> > Look at the co-occurrence statistics of the terms themselves.  Use that
> to
> > form a sparse graph.  Then do spectral clustering or agglomerative
> > clustering on the graph.
> >
> > That will give you clusters of terms that will give you much of what you
> > seek.  Of course, the fact that the terms are being used to describe the
> > same cluster means that you have a good chance of just replicating the
> label
> > sets for your clusters.
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>

Reply via email to