"Also, if your original matrix is A, then it is usually a shorter path to
results to analyze the word (item) cooccurrence matrix A'A.  The methods
below work either way."

The cooccurrence definitions I'm finding only use the summation-based one in
wikipedia. Are there any involving inverting the matrix instead? Or, as I
suspect, are the two exactly the same but I don't know enough linear
algebra?

On Mon, Aug 29, 2011 at 3:28 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Jeff,
>
> I think that this is a much simpler exposition:
> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>
> It makes the connection with entropy clear and allows a very simple
> implementation for more than 2x2 situations.
>
> More comments in-line:
>
> On Mon, Aug 29, 2011 at 1:34 PM, Jeff Hansen <dsche...@gmail.com> wrote:
>
> > ...
> > If we have a item/feature matrix (document/words, user/movies) then we
> can
> > consider each row or column to be a sample from larger reference
> population
> > of the full matrix.
>
>
> Well, sort of.  If you have a document (user) and a word (item), then you
> have a joint probability that any given interaction will be between this
> document and word.  We pretend in this case that each interaction is
> independent of every other which is patently not true, but very helpful.
>
> This joint probability can be approximated more or less accurately as the
> product of document (user) and word (item) popularities.  In matrix terms,
> this is the same as approximating the joint probability as a rank-1 matrix
> that is the outer produced of user and item probability distributions, each
> of which is a vector.  The point of what we are trying to do is to find a
> good and economical approximation of the joint probability that captures
> important deviations from this rank-1 approximation.  There are two
> problems
> that come up in trying to find this approximation.  The first is that we
> see
> counts rather than probabilities and the second is that the matrix is big.
>  Sometimes, it is very big.
>
>
> > In that case the log likelihood significance for any
> > given observation within the sample will be based on the observation
> itself
> > (1), the count of observations for the column (documents with this
> > word/users that watched this movie), the count of observations for the
> row
> > (words in this document, movies this user has watched) and the total
> number
> > of any observation within the reference population.  Given those numbers,
> > G2
> > or log likelihood is just a calculation of a, b, c and d (values
> described
> > above respectively)
> >
>
> As the blog I mentioned above makes more clear than original paper, you
> need
> to reduce the counts before taking them as entries in the contingency
> table.
>  If you are looking at the typical problem of vocabulary in documents, then
> the discrepancy won't be huge, but if you have any very popular items, this
> could be a problem.
>
> Given the log likelihood for each individual observation, is the best way
> to
> > apply it simply to remove observations that don't have a high enough
> level
> > of significance?
>
>
> Roughly.  It is a bit misleading to talk about "significance" here since we
> aren't trying to do a classic frequentist hypothesis test.  Instead, what
> we
> are trying to do is prioritize which joint terms we need to use to get a
> usefully better approximation of the underlying joint distribution.
>
>
> > Or is there a better way to normalize the values rather
> > than removing less significant ones.
>
>
> There are a lot of ways of approaching this.  Most don't do any better than
> the simple expedient of just keeping the highest scoring k words for each
> document (or items for each user).  Having k be the same for all users is
> not a bad thing at all from a number of angles.  I often use k = 50 to 300.
>
> Also, in this case I feel like it makes more sense to treat Users like
> > Documents and movies like observations of words within a document, so
> > compare each User and their observations to
> > the reference population rather than each Movie and the users that
> reviewed
> > it.
>
>
> I think that you are on the right track here, but your second sentence made
> a distinction I didn't understand.  User is to item as Document is to Word
> is a fine metaphor for recording your observations. I strongly prefer
> binary
> observations, so I usually either count all ratings as 1 or all ratings
> above a threshold as 1 with everything else being a zero.
>
> Whether you subsequently try to find users similar to each other or items
> similar to each other doesn't much matter.  All are reasonable things to
> try.  Whether they are useful is up to you and your customer to decide.
>
> Also, if your original matrix is A, then it is usually a shorter path to
> results to analyze the word (item) cooccurrence matrix A'A.  The methods
> below work either way.
>
>
> > I realize this is a Mahout user list, so I feel somewhat guilty
> constantly
> > pasting in R code, but here's how I ended up implementing it.
>
>
> I do it.  Use the right tools for the right task.  R is great for
> prototyping.
>
>
> > #rough equation that calculates log likelihood given a contingency table
> of
> > counts
> > llr <- function(a,b,c,d){
> >    r=(a+b)/(c+d)
> >    e1=c*r
> >    e2=d*r
> >    g2=2*(a*log(a/e1)+b*log(b/e2))
> >    return(g2)
> > }
> >
>
> So this code implies that you are arranging the elements of the contingency
> table this way:
>
>  a    b  | a+b
>  c    d  | c+d
> ---------+--------
> a+c  b+d | a+b+c+d
>
> But your later code you pass in data that uses this convention
>
>  a   b-a   | b
>  c   d-b-c | d-b
> -----------+-----
> a+c  d-a-c | d
>
> I don't like this form because it makes code more complex overall and
> breaks
> important symmetries.
>
> #get counts to calculate log likelihood for function above
> > a <- 1
> > b <- apply(A,2,sum)
>
> c <- apply(A,1,sum)
> >
>
> you can also use rowSums and columnSums
>
> d <- sum(A)
> >
> > #export sparse matrix A to a dataframe for calculating llr
> > triples <- summary(A)
> > #create b and c value vectors that sync up with the dataframe columns
> > bx <- b[triples$j]
> > cx <- c[triples$i]
> > #caculate the log likelihood for each entry in the dataframe
> > ll <- llr(a,bx,cx,d)
> >
>
> This probably needs to be something more like
>
> llr(a, bx-a, d-cx, d-bx-cx+a)
>
> (I think)
>
> Also, summary produces different values for different kinds of matrices.
>  As
> you point out, it is very handy for sparse matrices, but it is nice to have
> code that works on sparse or dense matrices.
>
>
> > #create a logical operator vector that indicates whether each entry in
> > dataframe is significant
> > significant <- ll>3
> >
>
> I think that a much higher cutoff is better.  Commonly a cutoff of 10-30 is
> useful.  Also, you may be better off if you can take the k-highest.
>
>
> > #build a new sparse vector that only entries with significant enough log
> > likelihood
> > B <- sparseMatrix(i=triples$i[significant],j=triples$j[significant],x=1)
> >
>
>
>
> >
> > By the way, I always used to struggle with matrix multiplication (I could
> > never quite remember which side was rows and which side was columns with
> > out
> > little mnemonics, but then again I've always struggled with remembering
> my
> > left from my right).  I recently realized it makes much more sense if you
> > picture it as a cube in 3space -- if you match up the lower left hand
> > corner
> > of the right matrix with the upper right hand corner of the left matrix
> and
> > put the product in between them so you get a backwards capital L shape,
> > then
> > fold back the right and left matrices you get three faces of a cube (or
> > cubic rectangle).  Then you just project inwards from the original
> matrices
> > and multiply where ever the values cross and sum them up those products
> as
> > you project them toward the face which is the resulting matrix.  Once you
> > visualize it this way it makes slicing and dicing the operation into
> > blockwise formulas trivial and obvious.  If only somebody had explained
> it
> > this way I would have found matrices much less intimidating -- has
> anybody
> > ever seen it explained this way?
> >
> >
> >
> >
> > On Fri, Aug 26, 2011 at 10:21 PM, Lance Norskog <goks...@gmail.com>
> wrote:
> >
> > >
> > >
> >
> http://www.nytimes.com/interactive/2010/01/10/nyregion/20100110-netflix-map.html
> > >
> > > Do not fear demographics. Yes, some people rent movies with all-black
> > > casts,
> > > and other people rent movies with all-white casts. And the Walmarts in
> > the
> > > SF East Bay have palettes full of Tyler Perry videos, while most of the
> > SF
> > > Peninsula don't know his name.
> > >
> > > And sometimes a bunch of Army Rangers have a great time watching
> > > "Confessions of a Shopaholic" in a tent, 120o farenheit, in Qatar.
> > >
> > > On Fri, Aug 26, 2011 at 8:29 AM, Jeff Hansen <dsche...@gmail.com>
> wrote:
> > >
> > > > Thanks for the math Ted -- that was very helpful.
> > > >
> > > > I've been using sparseMatrix() from libray(Matrix) -- largely based
> on
> > > your
> > > > response to somebody elses email.  I've been playing with smaller
> > > matrices
> > > > mainly for my own learning purposes -- it's much easier to read
> through
> > > 200
> > > > movies (most of which I've heard of) and get a gut feel, than 10,000
> > > > movies.
> > > >  It also means I don't have to shut down my session quite as often
> > (I've
> > > > been using rstudio) when I run something over the wrong
> > dimension(column
> > > vs
> > > > row).  I was running into some limitations as well, but I think some
> of
> > > > those may have had more to do with my own typos and misunderstanding
> of
> > > > that
> > > > language.
> > > >
> > > > There's a second reason I've been avoiding the tail -- while working
> on
> > > > this
> > > > as a possible project to propose, I had a bit of a "duh" moment.
> > >  Sometimes
> > > > it's important to realize the real world constraints.  Picture a
> > company
> > > > with very physical locations and very limited shelf space (say a
> > > > convenience
> > > > store or even a vending machine...)  Netflix and Amazon can make a
> lot
> > of
> > > > money in the long tail because they have centralized and or digital
> > > > inventories that service a large customer base.  Imagine a situation
> > > where
> > > > you have a small physical distributed inventory with an equally
> > > distributed
> > > > customer base.  In that situation it doesn't pay to chase after the
> > tail
> > > --
> > > > you just need to reduce your costs and target the head where you get
> > more
> > > > volume with less items.  So you really end up with a three tier
> market
> > > > segmentation -- one strategy works best for the head, another for the
> > > body,
> > > > and a third for the tail.
> > > >
> > > > As far as clusters go -- I really wasn't finding any clusters at the
> > > edges
> > > > of the data, but that could have more to do with not including the
> tail
> > > > (and
> > > > not normalizing appropriately for popularity).
> > > >
> > > > Incidentally, there was one amusingly cautionary anecdote I'd share
> --
> > I
> > > > don't remember the exact spread, but at one point I had two extremes
> > that
> > > > included something like "the johnson family vacation", "my baby's
> > daddy",
> > > > and "barbershop 2" on the one side, and then titles like "mean
> girls",
> > > "50
> > > > first dates" and "along came polly" on the other side.  You may have
> to
> > > > look
> > > > up those titles to see what I'm talking about, but I'd say the
> > > distinction
> > > > will be pretty clear when you do -- and if you were simply automating
> > all
> > > > of
> > > > this and not reviewing it with common sense, you could end up
> offending
> > > > some
> > > > of your users...  All I'm saying is there may be trends in the real
> > world
> > > > that some people aren't comfortable having pointed out.
> > > >
> > > > On Fri, Aug 26, 2011 at 4:08 AM, Lance Norskog <goks...@gmail.com>
> > > wrote:
> > > >
> > > > > This is a little meditation on user v.s. item matrix density. The
> > > > > heavy users and heavy items can be subsampled, once they are
> > > > > identified. Hadoop's built-in sort does give a very simple
> > > > > "map-increase" way to do this sort.
> > > > >
> > > > >
> > http://ultrawhizbang.blogspot.com/2011/08/sorted-recommender-data.html
> > > > >
> > > > > On Thu, Aug 25, 2011 at 5:57 PM, Ted Dunning <
> ted.dunn...@gmail.com>
> > > > > wrote:
> > > > > > In matrix terms the binary user x item matrix maps a set of items
> > to
> > > > > users
> > > > > > (A h = users who interacted with items in h).  Similarly A' maps
> > > users
> > > > to
> > > > > > items.  Thus A' (A h) is the classic "users who x'ed this also
> x'ed
> > > > that"
> > > > > > sort of operation.  This can be rearranged to be (A'A) h.
> > > > > >
> > > > > > This is where the singular values come in.  If
> > > > > >
> > > > > >     A \approx U S V'
> > > > > >
> > > > > > and we know that U'U = V'V = I from the definition of the SVD.
> >  This
> > > > > means
> > > > > > that
> > > > > >
> > > > > >    A' A \approx (U S V')' (U S V') = V S U' U S V' = V S^2 V'
> > > > > >
> > > > > > But V' transforms an item vector into the reduced dimensional
> space
> > > so
> > > > we
> > > > > > can view this as transforming the item vector into the reduced
> > > > > dimensional
> > > > > > space, scaling element-wise using S^2 and then transforming back
> to
> > > > item
> > > > > > space.
> > > > > >
> > > > > > As you noted, the first right singular vector corresponds to
> > > > popularity.
> > > > >  It
> > > > > > is often not appropriate to just recommend popular things (since
> > > users
> > > > > have
> > > > > > other means of discovering them) so you might want to drop that
> > > > dimension
> > > > > > from the analysis.  This can be done in the SVD results or it can
> > be
> > > > done
> > > > > > using something like a log-likelihood ratio test so that we only
> > > model
> > > > > > anomalously large values in A'A.
> > > > > >
> > > > > > Btw, if you continue to use R for experimentation, you should be
> > able
> > > > to
> > > > > > dramatically increase the size of the analysis you do by using
> > sparse
> > > > > > matrices and a random projection algorithm.  I was able to
> compute
> > > > SVD's
> > > > > of
> > > > > > matrices with millions of rows and columns and a few million
> > non-zero
> > > > > > elements in a few seconds.  Holler if you would like details
> about
> > > how
> > > > to
> > > > > do
> > > > > > this.
> > > > > >
> > > > > > On Thu, Aug 25, 2011 at 4:07 PM, Jeff Hansen <dsche...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > >> One thing I found interesting (but not particularly surprising)
> is
> > > > that
> > > > > the
> > > > > >> biggest singular value/vector was pretty much tied directly to
> > > volume.
> > > > > >>  That
> > > > > >> makes sense because the best predictor of whether a given fields
> > > value
> > > > > was
> > > > > >> 1
> > > > > >> was whether it belonged to a row with lots of 1s or a column
> with
> > > lots
> > > > > of
> > > > > >> 1s
> > > > > >> (I haven't quite figured out the best method to normalize the
> > values
> > > > > with
> > > > > >> yet).  When I plot the values of the largest singular vectors
> > > against
> > > > > the
> > > > > >> sum of values in the corresponding row or column, the
> correlation
> > is
> > > > > very
> > > > > >> linear.  That's not the case for any of the other singular
> vectors
> > > > > (which
> > > > > >> actually makes me wonder if just removing that first singular
> > vector
> > > > > from
> > > > > >> the prediction might not be one of the best ways to normalize
> the
> > > > data).
> > > > > >>
> > > > > >> I understand what you're saying about each singular vector
> > > > corresponding
> > > > > to
> > > > > >> a feature though.  Each left singular vector represents some
> > > abstract
> > > > > >> aspect
> > > > > >> of a movie and each right singular vector represents users
> > leanings
> > > or
> > > > > >> inclinations with regards to that aspect of the movie.  The
> > singular
> > > > > value
> > > > > >> itself just seems to indicate how good a predictor the
> combination
> > > of
> > > > > one
> > > > > >> users inclination toward that aspect of a movie is for coming up
> > > with
> > > > > the
> > > > > >> actual value.  The issue I mentioned above is that popularity of
> a
> > > > movie
> > > > > as
> > > > > >> well as how often a user watches movies tend to be the best
> > > predictors
> > > > > of
> > > > > >> whether a user has seen or will see a movie.
> > > > > >>
> > > > > >> I had been picturing this with the idea of one k dimensional
> space
> > > --
> > > > > one
> > > > > >> where a users location corresponds to their ideal prototypical
> > > movies
> > > > > >> location. Not that there would be a movie right there, but there
> > > would
> > > > > be
> > > > > >> ones nearby, and the nearer they were, the more enjoyable they
> > would
> > > > be.
> > > > > >>  That's a naive model, but that doesn't mean it wouldn't work
> well
> > > > > enough.
> > > > > >>  My problem is I don't quite get how to map the user space over
> to
> > > the
> > > > > item
> > > > > >> space.
> > > > > >>
> > > > > >> I think that may be what Lance is trying to describe in his last
> > > > > response,
> > > > > >> but I'm falling short on reconstructing the math from his
> > > description.
> > > > > >>
> > > > > >> I get the following
> > > > > >>
> > > > > >> U S V* = A
> > > > > >> U S = A V
> > > > > >> U = A V 1/S
> > > > > >>
> > > > > >> I think that last line is what Lance was describing.  Of course
> my
> > > > > problem
> > > > > >> was bootstrapping in a user for whom I don't know A or V.
> > > > > >>
> > > > > >> I also think I may have missed a big step of the puzzle.  For
> some
> > > > > reason I
> > > > > >> thought that just by loosening the rank, you could recompose the
> > > > Matrix
> > > > > A
> > > > > >> from the truncated SVD values/vectors and use the recomposed
> > values
> > > > > >> themselves as the recommendation.  I thought one of the ideas
> was
> > > that
> > > > > the
> > > > > >> recomposed matrix had less "noise" and could be a better
> > > > representation
> > > > > of
> > > > > >> the underlying nature of the matrix than the original matrix
> > itself.
> > > > >  But
> > > > > >> that may have just been wishful thinking...
> > > > > >>
> > > > > >> On Thu, Aug 25, 2011 at 4:50 PM, Sean Owen <sro...@gmail.com>
> > > wrote:
> > > > > >>
> > > > > >> > The 200x10 matrix is indeed a matrix of 10 singular vectors,
> > which
> > > > are
> > > > > >> > eigenvectors of AA'. It's the columns, not rows, that are
> > > > > >> > eigenvectors.
> > > > > >> >
> > > > > >> > The rows do mean something. I think it's fair to interpret the
> > 10
> > > > > >> > singular values / vectors as corresponding to some underlying
> > > > features
> > > > > >> > of tastes. The rows say how much each user expresses those 10
> > > > tastes.
> > > > > >> > The matrix of right singular vectors on the other side tells
> you
> > > the
> > > > > >> > same thing about items. The diagonal matrix of singular values
> > in
> > > > the
> > > > > >> > middle also comes into play -- it's like a set of multipliers
> > that
> > > > say
> > > > > >> > how important each feature is. (This is why we cut out the
> > > singular
> > > > > >> > vectors / values that have the smallest singular values; it's
> > like
> > > > > >> > removing the least-important features.) So really you'd have
> to
> > > > stick
> > > > > >> > those values somewhere; Ted says it's conventional to put
> "half"
> > > of
> > > > > >> > each (their square roots) with each side if anything.
> > > > > >> >
> > > > > >> > I don't have as good a grasp on an intuition for the columns
> as
> > > > > >> > eigenvectors. They're also a set of basis vectors, and I had
> > > > > >> > understood them as like the "real" bases of the reduced
> feature
> > > > space
> > > > > >> > expressed in user-item space. But I'd have to go back and
> think
> > > that
> > > > > >> > intuition through again since I can't really justify it from
> > > scratch
> > > > > >> > again in my head just now.
> > > > > >> >
> > > > > >> >
> > > > > >> > On Thu, Aug 25, 2011 at 10:21 PM, Jeff Hansen <
> > dsche...@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> > > Well, I think my problem may have had more to do with what I
> > was
> > > > > >> calling
> > > > > >> > the
> > > > > >> > > eigenvector...  I was referring to the rows rather than the
> > > > columns
> > > > > of
> > > > > >> U
> > > > > >> > and
> > > > > >> > > V.  While the columns may be characteristic of the overall
> > > matrix,
> > > > > the
> > > > > >> > rows
> > > > > >> > > are characteristic of the user or item (in that they are a
> > rank
> > > > > reduced
> > > > > >> > > representation of that person or thing). I guess you could
> say
> > I
> > > > > just
> > > > > >> had
> > > > > >> > to
> > > > > >> > > tilt my head to the side and change my perspective 90
> degrees
> > =)
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lance Norskog
> > > > > goks...@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Lance Norskog
> > > goks...@gmail.com
> > >
> >
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to