Good point.

btw, why use counts instead of probabilities? for easy and efficient
implementation?
also, do you think the similarity score using counts might quite differ
from using probabilities?

thank you very much for your prompt reply. [?]


On Wed, Apr 10, 2013 at 5:50 PM, Sean Owen <sro...@gmail.com> wrote:

> These events do sound 'similar'. They occur together about half the
> time either one of them occurs. You might have many pairs that end up
> being similar for the same reason, and this is not surprising. They're
> all "really" similar.
>
> The mapping here from LLR's range of [0,inf) to [0,1] is pretty
> arbitrary, but it is an increasing function of LLR. So the ordering
> you get is exactly the ordering LLR dictates. Yes you are going to get
> a number of values near 1 at the top, but does it matter?
>
> LLR = 0 and similarity = 0 when the events appear perfectly
> independent. For example, if A and B occur with probability 10%,
> independently, then you might have k11 = 1, k12 = 9, k21 = 9, k22 =
> 81. The matrix (joint probability) has no more info than the marginal
> probabilities, so the matrix entropy == row entropy + col entropy and
> LLR = 0.
>
>
> On Wed, Apr 10, 2013 at 10:15 AM, Phoenix Bai <baizh...@gmail.com> wrote:
> > Hi,
> >
> > the counts for two events are:
> > * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
> > k21=13**k22=300,000*
> > according to the code, I will get:
> >
> > rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
> > colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
> > matrixEntropy(entropy(7,8,13, 300,000) = 458
> >
> > thus,
> >
> > LLR=2.0*(458-222-152) = 168
> > similarityScore = 1 - 1/(1+168) = 0.994
> >
> > So, my problem is,
> > the similarity scores I get for all the items are all this high and it
> > makes it so hard to identify the real similar ones.
> >
> > As you can see, the counts of event A, and B are quite small while the
> > total count for k22 is quite high. And this phenomenon is quite common in
> > my dataset.
> >
> > So, my question is,
> > what kind of adjustment could I do to lower the similarity score to a
> more
> > reasonable range?
> >
> > Please shed some lights, thanks in advance!
>

Reply via email to