Hi,

the counts for two events are:
* **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
k21=13**k22=300,000*
according to the code, I will get:

rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
matrixEntropy(entropy(7,8,13, 300,000) = 458

thus,

LLR=2.0*(458-222-152) = 168
similarityScore = 1 - 1/(1+168) = 0.994

So, my problem is,
the similarity scores I get for all the items are all this high and it
makes it so hard to identify the real similar ones.

As you can see, the counts of event A, and B are quite small while the
total count for k22 is quite high. And this phenomenon is quite common in
my dataset.

So, my question is,
what kind of adjustment could I do to lower the similarity score to a more
reasonable range?

Please shed some lights, thanks in advance!

Reply via email to