Re: co-occurrence paper and code

Dmitriy Lyubimov Tue, 12 Aug 2014 11:06:25 -0700

Ok. so here's the recap of what i have gathered (also illustrates source of
my initial confusion).

(1) The original test assumes that there are two Bernoulli random
variables, A and B, and their draws are conducted at the same time. Hence,
the logic goes, the LLR statistics will asymptotically approximate
chi-squared with df = 1. Another point here is that for discrete events
information measures baked in in the statistics work better _to compare_
things, than normality assumptions of the chi-squared distributions, for
the few datasets studied. The hope is that this study can extrapolate to
other data sets.

Note that i have found no accuracy study done that we can use chi-squared
distribution to derive p-values and makes confidence quanification
statements as in "we have 95% confidence that was not coincidence in
behavior of the subject". Which is in part, perhaps, due to approximation,
and in part perhaps because of assumptions of the problems i discuss below.

(2) With Surprise & concidence paper, the study goes over evaluating what i
call for now a "bigram problem", which is signficantly different setting
from the CF setting in co-occurrence paper.

With bigram problem, as far as i gathered from numbers, we collect N
bigrams. N could be thought of as number of edges in directed labeled graph
connecting token-representing nodes, with labels representing counts.
Obviously this graph is then can be represented as a sparse cooccurrence
matrix. Under this setting, "A&B" represents matrix element, "withA" and
"withB" represents correspondent element of colSums(), and the total trial
count (N) represents matrix elementwise sum (aka 1st p-norm). This problem
formulation for co-occurrence matrix, as we will see, is signficantly
different from the CF co-occurrence construction.

Another thing is that with bigram problem we encounter stretch of sampling
assumptions. In order to comply with assumptions made in (1), and keeping
in mind that 1bigram = 1draw, we have to assume we do a bernoulli draw for
every term with every bigram. But actual data tells us that only exactly
two variables (A and B) can draw as  true at every draw; i.e. we just do
one hypergeometrical (or multinomial, if you are willing to consider
self-co-occurrences) with k=2, and set the rest of the term set as being
drawn as 0s. Then, again, we can demonstrate that it works better to
_compare_ things still (or do top-n), since assumption stretch equally
applies to all, but somehow it still feels like we cannot reliably
demostrate accuracy of confidence measurements of bigrams, and, again, it
would seem no  accuracy study on a concrete dataset for confidence measure
was actually attempted.

(3) The Sebastien's paper describes another kind of problem, which i called
"CF co-occurrence" to oppose the "bigram" case. In here, we test each item
independently using individual user choice as a trial. Hence, number of
trials (N) = number of user per item, and withA and withB are taken as
elements of colsums of user/item matrix (not co-occurrence matrix). makes
sense.

Relaxation of assumption here is as follows. We obviously want to measure
unbiased trial behavor for items, since when we present them to user as
recommendation, we will solely rely on item true value given fairly
guaranteed portion of user attention. However, by the way of sampling and
construction of  building our trials dataset, we experience, as they say
it,  a _biased sampling_ for reasons mentioned above (UI does not guarantee
equal portions of user attention per item, so significant portion of
assumed negative trials simply did not occur).

In extreme case, when we just start the system, we have 0 exploration
happened, and thus our degenerate experiments represent an extremely biased
trials data. As we move towards more explored landscape, our bias
(hopefully) dissappear. However, it is not quite clear what time it would
take, and whether this time would be shorter than the stock time of the
item in many cases. So, in most realistic cases the trials sampling bias
will persist. Among other things, this also significantly relaxes accuracy
of any sort of confidence statement along the lines "we are 95% confident A
and B co-occurred, if any, due to chance".

(4) Now, about my problem.

My problem is more like "bigram" problem rather than "CF" problem. I am
constructing co-occurrence based on labeled graph and then evaluate the
chances of non-independence of co-occurrences. As such, i never have A'A
operation (and no use of users as equivalent of trials).

Comparing scores and inferring top N is of course still interesting for me.
However, perhaps more importantly, i was aiming to evaluate confidence in
more or less chi-squared p-value terms and make statements along the lines,
"given bias-compensated nature of trials, we can express 95% confidence of
non-coincidence of events A and B".

I do compare things as well, but also i expect a great amount of
non-confident results (due to the way data collected), and being able to
throw away a great deal of low-confidence co-occurrences. Maybe even 99%
co-occurrences (the way experiments are conducted, we can parametrise them
to manipulate true co-occurrence recall/precision, but we need to be able
to tell false co-occurrences more or less confidently in order to make a
good judgement about experiment parameters). Changing parameters of
experiment is obviously changing distributions of trials it produces, but
under this particular problem it is o.k. for various reasons as it also
translates into the distributions about which prediction is made.

On Mon, Aug 11, 2014 at 5:22 PM, Ted Dunning <[email protected]> wrote:

> Opportunity in this case does not refer to "Saw something and decided to
> buy/click/view/listen".  Instead, it refers to the entire process including
> the UI and whatever discovery mechanisms exist.
>
> You are correct that it is not just measuring the human component.
>
>
>
> On Mon, Aug 11, 2014 at 4:56 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > this logic does assume though that every user had an opportunity to
> eyeball
> > (A,B) pair and chose (vastly) not to buy them whereas it may have
> > overwhelmingly happened due to lack of exploration. thus, (not A & not B)
> > figures are probably grossly overestimated
> >
> >
> > On Mon, Aug 11, 2014 at 4:53 PM, Dmitriy Lyubimov <[email protected]>
> > wrote:
> >
> > > No, the question was why total number of trials sent to LLR is
> considered
> > > to be m where M is m x n is a user/item matrix.
> > >
> > > ok i got it. i made some incorrect assumption about previous code
> steps,
> > > hence my inference derailed.
> > >
> > >
> > >
> > >
> > > On Mon, Aug 11, 2014 at 4:07 PM, Ted Dunning <[email protected]>
> > > wrote:
> > >
> > >> If you binarize the original occurrence matrix then it seems to me
> that
> > >> the
> > >> values in the cooccurrence matrix *are* user counts.
> > >>
> > >> Perhaps I misunderstand your original question.
> > >>
> > >>
> > >>
> > >> On Mon, Aug 11, 2014 at 3:44 PM, Dmitriy Lyubimov <[email protected]>
> > >> wrote:
> > >>
> > >> > sorry rather total occurrences of a pair should be sum(a_i) +
> > sum(a_j) -
> > >> > a_ij (not 1norm of course)
> > >> >
> > >> >
> > >> > On Mon, Aug 11, 2014 at 3:11 PM, Dmitriy Lyubimov <
> [email protected]>
> > >> > wrote:
> > >> >
> > >> > > Why coocurrence code takes number of users as total interactions?
> > >> > > shouldn't that be 1-norm of the co-occurrence matrix?
> > >> > >
> > >> > >
> > >> > > On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov <
> [email protected]
> > >
> > >> > > wrote:
> > >> > >
> > >> > >> So, compared to original paper [1], similarity is now hardcoded
> and
> > >> > >> always LLR? Do we have any plans to parameterize that further? Is
> > >> there
> > >> > any
> > >> > >> reason to parameterize it?
> > >> > >>
> > >> > >>
> > >> > >> Also, reading the paper, i am a bit wondering -- similarity and
> > >> distance
> > >> > >> are functions that usually are moving into different directions
> > (i.e.
> > >> > >> cosine similarity and angular distance) but in the paper distance
> > >> scores
> > >> > >> are also considered similarities? How's that?
> > >> > >>
> > >> > >> I suppose in that context LLR is considered a distance (higher
> > scores
> > >> > >> mean more `distant` items, co-occurring by chance only)?
> > >> > >>
> > >> > >> [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf
> > >> > >>
> > >> > >> -d
> > >> > >>
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: co-occurrence paper and code

Reply via email to