Ok. so here's the recap of what i have gathered (also illustrates source of my initial confusion).
(1) The original test assumes that there are two Bernoulli random variables, A and B, and their draws are conducted at the same time. Hence, the logic goes, the LLR statistics will asymptotically approximate chi-squared with df = 1. Another point here is that for discrete events information measures baked in in the statistics work better _to compare_ things, than normality assumptions of the chi-squared distributions, for the few datasets studied. The hope is that this study can extrapolate to other data sets. Note that i have found no accuracy study done that we can use chi-squared distribution to derive p-values and makes confidence quanification statements as in "we have 95% confidence that was not coincidence in behavior of the subject". Which is in part, perhaps, due to approximation, and in part perhaps because of assumptions of the problems i discuss below. (2) With Surprise & concidence paper, the study goes over evaluating what i call for now a "bigram problem", which is signficantly different setting from the CF setting in co-occurrence paper. With bigram problem, as far as i gathered from numbers, we collect N bigrams. N could be thought of as number of edges in directed labeled graph connecting token-representing nodes, with labels representing counts. Obviously this graph is then can be represented as a sparse cooccurrence matrix. Under this setting, "A&B" represents matrix element, "withA" and "withB" represents correspondent element of colSums(), and the total trial count (N) represents matrix elementwise sum (aka 1st p-norm). This problem formulation for co-occurrence matrix, as we will see, is signficantly different from the CF co-occurrence construction. Another thing is that with bigram problem we encounter stretch of sampling assumptions. In order to comply with assumptions made in (1), and keeping in mind that 1bigram = 1draw, we have to assume we do a bernoulli draw for every term with every bigram. But actual data tells us that only exactly two variables (A and B) can draw as true at every draw; i.e. we just do one hypergeometrical (or multinomial, if you are willing to consider self-co-occurrences) with k=2, and set the rest of the term set as being drawn as 0s. Then, again, we can demonstrate that it works better to _compare_ things still (or do top-n), since assumption stretch equally applies to all, but somehow it still feels like we cannot reliably demostrate accuracy of confidence measurements of bigrams, and, again, it would seem no accuracy study on a concrete dataset for confidence measure was actually attempted. (3) The Sebastien's paper describes another kind of problem, which i called "CF co-occurrence" to oppose the "bigram" case. In here, we test each item independently using individual user choice as a trial. Hence, number of trials (N) = number of user per item, and withA and withB are taken as elements of colsums of user/item matrix (not co-occurrence matrix). makes sense. Relaxation of assumption here is as follows. We obviously want to measure unbiased trial behavor for items, since when we present them to user as recommendation, we will solely rely on item true value given fairly guaranteed portion of user attention. However, by the way of sampling and construction of building our trials dataset, we experience, as they say it, a _biased sampling_ for reasons mentioned above (UI does not guarantee equal portions of user attention per item, so significant portion of assumed negative trials simply did not occur). In extreme case, when we just start the system, we have 0 exploration happened, and thus our degenerate experiments represent an extremely biased trials data. As we move towards more explored landscape, our bias (hopefully) dissappear. However, it is not quite clear what time it would take, and whether this time would be shorter than the stock time of the item in many cases. So, in most realistic cases the trials sampling bias will persist. Among other things, this also significantly relaxes accuracy of any sort of confidence statement along the lines "we are 95% confident A and B co-occurred, if any, due to chance". (4) Now, about my problem. My problem is more like "bigram" problem rather than "CF" problem. I am constructing co-occurrence based on labeled graph and then evaluate the chances of non-independence of co-occurrences. As such, i never have A'A operation (and no use of users as equivalent of trials). Comparing scores and inferring top N is of course still interesting for me. However, perhaps more importantly, i was aiming to evaluate confidence in more or less chi-squared p-value terms and make statements along the lines, "given bias-compensated nature of trials, we can express 95% confidence of non-coincidence of events A and B". I do compare things as well, but also i expect a great amount of non-confident results (due to the way data collected), and being able to throw away a great deal of low-confidence co-occurrences. Maybe even 99% co-occurrences (the way experiments are conducted, we can parametrise them to manipulate true co-occurrence recall/precision, but we need to be able to tell false co-occurrences more or less confidently in order to make a good judgement about experiment parameters). Changing parameters of experiment is obviously changing distributions of trials it produces, but under this particular problem it is o.k. for various reasons as it also translates into the distributions about which prediction is made. On Mon, Aug 11, 2014 at 5:22 PM, Ted Dunning <[email protected]> wrote: > Opportunity in this case does not refer to "Saw something and decided to > buy/click/view/listen". Instead, it refers to the entire process including > the UI and whatever discovery mechanisms exist. > > You are correct that it is not just measuring the human component. > > > > On Mon, Aug 11, 2014 at 4:56 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > this logic does assume though that every user had an opportunity to > eyeball > > (A,B) pair and chose (vastly) not to buy them whereas it may have > > overwhelmingly happened due to lack of exploration. thus, (not A & not B) > > figures are probably grossly overestimated > > > > > > On Mon, Aug 11, 2014 at 4:53 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > No, the question was why total number of trials sent to LLR is > considered > > > to be m where M is m x n is a user/item matrix. > > > > > > ok i got it. i made some incorrect assumption about previous code > steps, > > > hence my inference derailed. > > > > > > > > > > > > > > > On Mon, Aug 11, 2014 at 4:07 PM, Ted Dunning <[email protected]> > > > wrote: > > > > > >> If you binarize the original occurrence matrix then it seems to me > that > > >> the > > >> values in the cooccurrence matrix *are* user counts. > > >> > > >> Perhaps I misunderstand your original question. > > >> > > >> > > >> > > >> On Mon, Aug 11, 2014 at 3:44 PM, Dmitriy Lyubimov <[email protected]> > > >> wrote: > > >> > > >> > sorry rather total occurrences of a pair should be sum(a_i) + > > sum(a_j) - > > >> > a_ij (not 1norm of course) > > >> > > > >> > > > >> > On Mon, Aug 11, 2014 at 3:11 PM, Dmitriy Lyubimov < > [email protected]> > > >> > wrote: > > >> > > > >> > > Why coocurrence code takes number of users as total interactions? > > >> > > shouldn't that be 1-norm of the co-occurrence matrix? > > >> > > > > >> > > > > >> > > On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov < > [email protected] > > > > > >> > > wrote: > > >> > > > > >> > >> So, compared to original paper [1], similarity is now hardcoded > and > > >> > >> always LLR? Do we have any plans to parameterize that further? Is > > >> there > > >> > any > > >> > >> reason to parameterize it? > > >> > >> > > >> > >> > > >> > >> Also, reading the paper, i am a bit wondering -- similarity and > > >> distance > > >> > >> are functions that usually are moving into different directions > > (i.e. > > >> > >> cosine similarity and angular distance) but in the paper distance > > >> scores > > >> > >> are also considered similarities? How's that? > > >> > >> > > >> > >> I suppose in that context LLR is considered a distance (higher > > scores > > >> > >> mean more `distant` items, co-occurring by chance only)? > > >> > >> > > >> > >> [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf > > >> > >> > > >> > >> -d > > >> > >> > > >> > > > > >> > > > > >> > > > >> > > > > > > > > >
