Re: Methods for Naming Clusters

Ted Dunning Mon, 10 Aug 2009 11:42:43 -0700

I have some R code handy.  It should be easy to translate to Mahoutish Java.


*entropy <- function(x) {-sum(x*log((x+(x==0))/sum(x)))}*

The only tricky bit here is that sum will sum up a vector or a matrix and
(x==0) returns a matrix or vector of booleans that is the same shape as x.
Adding a boolean means that all instances where x!=0 are left alone while
all zeros are incremented by one.  This causes the log to return zero, but
we are (elementwise) multiplying by x anyway.  Elements of x cannot be < 0
by definition (they are counts).  The reason for all of that is so that 0
log 0 = 0 as desired.  If sum(x) = 1, then this function will give you
Shannon entropy.

and then:

*llr <- function(x) { 2*(entropy(rowSums(x)) + entropy(colSums(x)) -
entropy(x))}
*

The functions rowSums and colSums do pretty much what you would expect.
Given a matrix, they add up the elements in each row or column (to get a
vector).  I moved around some minus signs to make me happy, btw.  All that
matters in the end is that entropy and llr both give positive values.

Some test cases:

*> entropy(c(1,1))
[1] 1.386294
> llr(matrix(c(1,0,0,1), nrow=2))
[1] 2.772589
> llr(matrix(c(10,0,0,10), nrow=2))
[1] 27.72589
> llr(matrix(c(5,1995,0,100000), nrow=2))
[1] 39.33052
> llr(matrix(c(1000,1995,1000,100000), nrow=2))
[1] 4730.737
> llr(matrix(c(1000,1000,1000,100000), nrow=2))
[1] 5734.343
> llr(matrix(c(1000,1000,1000,99000), nrow=2))
[1] 5714.932
*


On Mon, Aug 10, 2009 at 11:11 AM, Shashikant Kore <[email protected]>wrote:

> Ted,
>
> Thank you for the elaborate explanation.
>
> I think, I just hit the class "this is left as an exercise to the
> reader. One last query (I hope)
>
> On your blog you have defined LLR as follows.
>
> >
> > LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
> > where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log
> (k_ij / sum(k))
> >
>
> I am unable to follow this. Do you have any code to explain this? Or
> elaboration for following example will also be equally great.
>
> > Also suppose that the corpus has 100,000 documents in it we have (k11,
> k12, k21, k22, llr) as
> >
> > 5, 1995, 100, 97900, 2.96
> >
>
> Thanks,
>
> --shashi
>
> On Mon, Aug 10, 2009 at 10:32 PM, Ted Dunning<[email protected]>
> wrote:
> > On Mon, Aug 10, 2009 at 6:51 AM, Shashikant Kore <[email protected]
> >wrote:
> >
> >> I have little difficulty in understanding LLR for cluster labels.
> >>
> >
> > Sorry about that.  I will try to be more clear.
> >
> >
> >>  For a phrase, if
> >> - in-cluster doc frequency is  inDF
> >> - out-of-cluster doc frequency is  outDF
> >> - size of the cluster is clusterSize
> >> - size of the corpus is corpusSize
> >>
> >
> > Good.
> >
> >
> >> how do I calculate the LLR?
> >>
> >
> > Assuming that the corpus is a superset of the cluster, form the table
> using:
> >
> >     k11 = inDF
> >     k12 = clusterSize - inDF
> >     k21 = outDF
> >     k22 = corpusSize - clusterSize - outDF
> >
> > If the cluster is not a subset of the corpus, then k22 = corpusSize -
> outDF
> >
> >
> >>  I have difficulty in mapping these numbers to Event A & Event B that
> >> you talked on your blog.
> >>
> >
> > Event A is in-cluster, Event B is out-of-cluster.
> >
> >
> >>  From the basic numbers, I could come up with inCluster percentage. But
> >> that doesn't help much. For example,  let's say my cluster size is
> >> 2000 documents and corpus size is 1000.  A phrase which occurs in the
> >> cluster in 5 documents and doesn't appear outside cluster has
> >> inCluster percentage of 100. Another phrase which occurs 1000 times in
> >> the cluster and 1000 times outside cluster. This phrase has inCluster
> >> percentage of 50. Intuitively, this is a better candidate for label
> >> that previous one. But, I am unable to figure out how these numbers
> >> need to be normalized.
> >>
> >
> > First, the corpus size should normally be much larger than your cluster
> > size.  With document categorization, the ratio is enormous, with
> clustering
> > it should still be at least one order of magnitude larger.
> >
> > So let's take your example and add a case where in-cluster = 5 and
> > out-cluster =5, and another where in-cluster=5, out-cluster=100 and
> another
> > where in-cluster=1000 and out-cluster 45,000.
> >
> > Also suppose that the corpus has 100,000 documents in it we have (k11,
> k12,
> > k21, k22, llr) as
> >
> > 5, 1995, 0, 98000, 39.33
> > 5, 1995, 5, 97995, 25.47
> > 5, 1995, 100, 97900, 2.96
> > 1000, 1000. 1000, 97000, 5714.93
> > 1000, 1000, 45000, 48000, 2.04
> >
> > According to llr, your original case of 5 in and 0 out is definitely
> worthy
> > of mention and the case with 5 in and 5 out is somewhat less interesting.
> > The case with 5 in and 100 out is not very interesting, nor is the case
> with
> > 1000 in and 45000 out.  Your case with 1000 in and 1000 out is the most
> > exciting of them all.
> >
> > The most useful way to think of this is as the percentage of in-cluster
> > documents that have the feature (term) versus the percentage out, keeping
> in
> > mind that both percentages are uncertain since we have only a sample of
> all
> > possible documents.  Where these percentages are very different and where
> > that difference is unlikely to be due to accidental variation, then LLR
> will
> > be large.
> >
> >
> > I don't know if I mentioned this on the blog, but it is often nice to
> > rescale these scores by taking the square root and adding a sign
> according
> > to whether k11/(k11+k12) > k21/(k21+k22).  This gives you a number that
> has
> > the same scale as a normal distribution so lots more people will have
> good
> > intuitions about what is large and what is not.
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Reply via email to