I dont have my code here to verify the result. Can you show the calculation here i mean the values of the log etc. Maybe will give a better idea
On Tue, Jan 12, 2010 at 6:19 PM, Shashikant Kore <[email protected]>wrote: > Hi, > > I am looking at LLR scores for two terms in a cluster which seem > non-intuitive to me. > > The corpus size is 706,120 and size of the cluster is 21964. > > Term1 appears in 904 docs in the cluster and 1144 docs outside the > cluster. > Term2 appears in 36 docs in the cluster and 60280 docs outside the > cluster. > > As I can see Term1 is rarer outside the cluster, but common in the > cluster (relatively speaking.) But, when I calculate LLR scores, > Term1's score (3569) is lower than that of Term2 (3622). This looks > counter-intuitive to me. Is it the case that LLR score is higher if > term is common outside the cluster and rare inside? Can this be > "fixed"? > > The k11, k12, k21,k22 values for Term1 and Term2 are as follows if you > wish to calculate. > > Term1 > k11 904 > k12 21060 > k21 1144 > k22 683012 > > Term2 > k11 36 > k12 21928 > k21 60280 > k22 623876 > > Thanks, > > --shashi >
