If you expand the LLR equation and look at which terms are big, you will see k_11 * log(mumble) as an important term for many words. Usually, this is about the same as tf.idf since mumble is about the same as the term frequency. For a single document, tf.idf is a very close approximation of LLR. With many documents, the situation can change dramatically, however, and you can get cancellation effects that eliminate documents that would otherwise have high tf.idf. These are generally the terms that lead to over-fitting with methods like naive bayes and are often not such great cluster descriptors.
On Tue, Aug 11, 2009 at 4:29 AM, Shashikant Kore <[email protected]>wrote: > A simple interpretation of this can be given by the fact that when a > phrase is quite common in a small cluster but uncommon out of cluster, > it is going to have a higher (TF-IDF) weight in the document vector. > Such a phrase is identified as prominent by LLR as well. > > Is there any other reason for this to occur? > -- Ted Dunning, CTO DeepDyve
