Re: Methods for Naming Clusters

Ted Dunning Tue, 11 Aug 2009 08:28:13 -0700

If you expand the LLR equation and look at which terms are big, you will see
k_11 * log(mumble)  as an important term for many words.  Usually, this is
about the same as tf.idf since mumble is about the same as the term
frequency.  For a single document, tf.idf is a very close approximation of
LLR.  With many documents, the situation can change dramatically, however,
and you can get cancellation effects that eliminate documents that would
otherwise have high tf.idf.  These are generally the terms that lead to
over-fitting with methods like naive bayes and are often not such great
cluster descriptors.


On Tue, Aug 11, 2009 at 4:29 AM, Shashikant Kore <[email protected]>wrote:

> A simple interpretation of this can be given by the fact that when a
> phrase is quite common in a small cluster but uncommon out of cluster,
> it is going to have a higher (TF-IDF) weight in the document vector.
> Such a phrase is identified as prominent by LLR as well.
>
> Is there any other reason for this to occur?
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Reply via email to