I chose against porting all the similarity measures to the dsl version of
the cooccurrence analysis for two reasons. First, adding the measures in a
generalizable way makes the code superhard to read. Second, in practice, I
have never seen something giving better results than llr. As Ted pointed
out, a lot of the foundations of using similarity measures comes from
wanting to predict ratings, which people never do in practice. I think we
should restrict ourselves to approaches that work with implicit, count-like
data.

-s
Am 06.08.2014 16:58 schrieb "Ted Dunning" <[email protected]>:

> On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov <[email protected]>
> > wrote:
> >
> > I suppose in that context LLR is considered a distance (higher scores
> mean
> > > more `distant` items, co-occurring by chance only)?
> > >
> >
> > Self-correction on this one -- having given a quick look at llr paper
> > again, it looks like it is actually a similarity (higher scores meaning
> > more stable co-occurrences, i.e. it moves in the opposite direction of
> >  p-value if it had been a classic  test
> >
>
> LLR is a classic test.  It is essentially Pearson's chi^2 test without the
> normal approximation.  See my papers[1][2] introducing the test into
> computational linguistics (which ultimately brought it into all kinds of
> fields including recommendations) and also references for the G^2 test[3].
>
> [1] http://www.aclweb.org/anthology/J93-1003
> [2] http://arxiv.org/abs/1207.1847
> [3] http://en.wikipedia.org/wiki/G-test
>

Reply via email to