Very close. You are conceptually exactly correct. If A contains binary visit or view data, then A'A contains counts that must be reduced to binary values or weights using some statistical procedure. I prefer LLR and binary results.
If A contains counts weighted by inverse user frequency, then your dot product is roughly usable as a similarity score. This is especially true if rows of A are normalized somehow to account for over-active users. On Wed, Sep 9, 2009 at 3:42 PM, Gökhan Çapan <[email protected]> wrote: > A is the user x item history matrix. Each row is a user history. > > > > A' is the transposed user x item matrix which is of the shape item x > user. > > > > A' A is the user-level item cooccurrence matrix and has the shape item x > > item. > > > > Then (A' A)ij is a similarity weight between ith and jth items. > if Aij is the "rating of ith user for jth item", the highest value of > "ith row of A' A" is the most similar item for "ith item". > if the values in A are binary, then (A' A)ij is number of users who have > rated/clicked/viewed both item i and item j. > -- Ted Dunning, CTO DeepDyve
