Whoa....

No.  It sounds like I have muddied things thoroughly.  What I was saying is
that there are times that tf.idf and llr agree and times that tf.idf and llr
disagree.  In my experience, most of the second category are where tf.idf is
over-weighting coincidental cases or where both scores are producing not
good stuff.

If a phrase or term is marked as good by LLR and is a prominent feature of
the centroid, that is fine.

On Tue, Aug 11, 2009 at 10:54 AM, Shashikant Kore <[email protected]>wrote:

> On Tue, Aug 11, 2009 at 8:57 PM, Ted Dunning<[email protected]> wrote:
> > If you expand the LLR equation and look at which terms are big, you will
> see
> > k_11 * log(mumble)  as an important term for many words.  Usually, this
> is
> > about the same as tf.idf since mumble is about the same as the term
> > frequency.  For a single document, tf.idf is a very close approximation
> of
> > LLR.  With many documents, the situation can change dramatically,
> however,
> > and you can get cancellation effects that eliminate documents that would
> > otherwise have high tf.idf.  These are generally the terms that lead to
> > over-fitting with methods like naive bayes and are often not such great
> > cluster descriptors.
> >
>
> Let me restate what I understood.
>
> If a phrase is identified as prominent phrase by LLR and it also
> happens to be the top-weighted feature in the centroid vector, it is
> not a good candidate for cluster label.
>
> Is this correct?




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to