"If you have a document (user) and a word (item), then you
have a joint probability that any given interaction will be between this
document and word.  We pretend in this case that each interaction is
independent of every other which is patently not true, but very helpful."

So if you subsample randomly to trim the data sizes, is it worth
deliberately breaking correlations? Lets say that in a user/item dataset,
almost all of the user/rating rows are in clusters of two or three according
to the timestamp. Instead of just a random subsample, is it worth removing 1
rating from each cluster? This would strengthen the Bayesian assumption.


On Mon, Aug 29, 2011 at 3:28 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Jeff,
>
> I think that this is a much simpler exposition:
> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>
> It makes the connection with entropy clear and allows a very simple
> implementation for more than 2x2 situations.
>
> More comments in-line:
>
> On Mon, Aug 29, 2011 at 1:34 PM, Jeff Hansen <dsche...@gmail.com> wrote:
>
> > ...
> > If we have a item/feature matrix (document/words, user/movies) then we
> can
> > consider each row or column to be a sample from larger reference
> population
> > of the full matrix.
>
>
> Well, sort of.  If you have a document (user) and a word (item), then you
> have a joint probability that any given interaction will be between this
> document and word.  We pretend in this case that each interaction is
> independent of every other which is patently not true, but very helpful.
>
> This joint probability can be approximated more or less accurately as the
> product of document (user) and word (item) popularities.  In matrix terms,
> this is the same as approximating the joint probability as a rank-1 matrix
> that is the outer produced of user and item probability distributions, each
> of which is a vector.  The point of what we are trying to do is to find a
> good and economical approximation of the joint probability that captures
> important deviations from this rank-1 approximation.  There are two
> problems
> that come up in trying to find this approximation.  The first is that we
> see
> counts rather than probabilities and the second is that the matrix is big.
>  Sometimes, it is very big.
>
>
> > In that case the log likelihood significance for any
> > given observation within the sample will be based on the observation
> itself
> > (1), the count of observations for the column (documents with this
> > word/users that watched this movie), the count of observations for the
> row
> > (words in this document, movies this user has watched) and the total
> number
> > of any observation within the reference population.  Given those numbers,
> > G2
> > or log likelihood is just a calculation of a, b, c and d (values
> described
> > above respectively)
> >
>
> As the blog I mentioned above makes more clear than original paper, you
> need
> to reduce the counts before taking them as entries in the contingency
> table.
>  If you are looking at the typical problem of vocabulary in documents, then
> the discrepancy won't be huge, but if you have any very popular items, this
> could be a problem.
>
> Given the log likelihood for each individual observation, is the best way
> to
> > apply it simply to remove observations that don't have a high enough
> level
> > of significance?
>
>
> Roughly.  It is a bit misleading to talk about "significance" here since we
> aren't trying to do a classic frequentist hypothesis test.  Instead, what
> we
> are trying to do is prioritize which joint terms we need to use to get a
> usefully better approximation of the underlying joint distribution.
>
>
> > Or is there a better way to normalize the values rather
> > than removing less significant ones.
>
>
> There are a lot of ways of approaching this.  Most don't do any better than
> the simple expedient of just keeping the highest scoring k words for each
> document (or items for each user).  Having k be the same for all users is
> not a bad thing at all from a number of angles.  I often use k = 50 to 300.
>
> Also, in this case I feel like it makes more sense to treat Users like
> > Documents and movies like observations of words within a document, so
> > compare each User and their observations to
> > the reference population rather than each Movie and the users that
> reviewed
> > it.
>
>
> I think that you are on the right track here, but your second sentence made
> a distinction I didn't understand.  User is to item as Document is to Word
> is a fine metaphor for recording your observations. I strongly prefer
> binary
> observations, so I usually either count all ratings as 1 or all ratings
> above a threshold as 1 with everything else being a zero.
>
> Whether you subsequently try to find users similar to each other or items
> similar to each other doesn't much matter.  All are reasonable things to
> try.  Whether they are useful is up to you and your customer to decide.
>
> Also, if your original matrix is A, then it is usually a shorter path to
> results to analyze the word (item) cooccurrence matrix A'A.  The methods
> below work either way.
>
>
> > I realize this is a Mahout user list, so I feel somewhat guilty
> constantly
> > pasting in R code, but here's how I ended up implementing it.
>
>
> I do it.  Use the right tools for the right task.  R is great for
> prototyping.
>
>
> > #rough equation that calculates log likelihood given a contingency table
> of
> > counts
> > llr <- function(a,b,c,d){
> >    r=(a+b)/(c+d)
> >    e1=c*r
> >    e2=d*r
> >    g2=2*(a*log(a/e1)+b*log(b/e2))
> >    return(g2)
> > }
> >
>
> So this code implies that you are arranging the elements of the contingency
> table this way:
>
>  a    b  | a+b
>  c    d  | c+d
> ---------+--------
> a+c  b+d | a+b+c+d
>
> But your later code you pass in data that uses this convention
>
>  a   b-a   | b
>  c   d-b-c | d-b
> -----------+-----
> a+c  d-a-c | d
>
> I don't like this form because it makes code more complex overall and
> breaks
> important symmetries.
>
> #get counts to calculate log likelihood for function above
> > a <- 1
> > b <- apply(A,2,sum)
>
> c <- apply(A,1,sum)
> >
>
> you can also use rowSums and columnSums
>
> d <- sum(A)
> >
> > #export sparse matrix A to a dataframe for calculating llr
> > triples <- summary(A)
> > #create b and c value vectors that sync up with the dataframe columns
> > bx <- b[triples$j]
> > cx <- c[triples$i]
> > #caculate the log likelihood for each entry in the dataframe
> > ll <- llr(a,bx,cx,d)
> >
>
> This probably needs to be something more like
>
> llr(a, bx-a, d-cx, d-bx-cx+a)
>
> (I think)
>
> Also, summary produces different values for different kinds of matrices.
>  As
> you point out, it is very handy for sparse matrices, but it is nice to have
> code that works on sparse or dense matrices.
>
>
> > #create a logical operator vector that indicates whether each entry in
> > dataframe is significant
> > significant <- ll>3
> >
>
> I think that a much higher cutoff is better.  Commonly a cutoff of 10-30 is
> useful.  Also, you may be better off if you can take the k-highest.
>
>
> > #build a new sparse vector that only entries with significant enough log
> > likelihood
> > B <- sparseMatrix(i=triples$i[significant],j=triples$j[significant],x=1)
> >
>
>
>
> >
> > By the way, I always used to struggle with matrix multiplication (I could
> > never quite remember which side was rows and which side was columns with
> > out
> > little mnemonics, but then again I've always struggled with remembering
> my
> > left from my right).  I recently realized it makes much more sense if you
> > picture it as a cube in 3space -- if you match up the lower left hand
> > corner
> > of the right matrix with the upper right hand corner of the left matrix
> and
> > put the product in between them so you get a backwards capital L shape,
> > then
> > fold back the right and left matrices you get three faces of a cube (or
> > cubic rectangle).  Then you just project inwards from the original
> matrices
> > and multiply where ever the values cross and sum them up those products
> as
> > you project them toward the face which is the resulting matrix.  Once you
> > visualize it this way it makes slicing and dicing the operation into
> > blockwise formulas trivial and obvious.  If only somebody had explained
> it
> > this way I would have found matrices much less intimidating -- has
> anybody
> > ever seen it explained this way?
> >
> >
> >
> >
> > On Fri, Aug 26, 2011 at 10:21 PM, Lance Norskog <goks...@gmail.com>
> wrote:
> >
> > >
> > >
> >
> http://www.nytimes.com/interactive/2010/01/10/nyregion/20100110-netflix-map.html
> > >
> > > Do not fear demographics. Yes, some people rent movies with all-black
> > > casts,
> > > and other people rent movies with all-white casts. And the Walmarts in
> > the
> > > SF East Bay have palettes full of Tyler Perry videos, while most of the
> > SF
> > > Peninsula don't know his name.
> > >
> > > And sometimes a bunch of Army Rangers have a great time watching
> > > "Confessions of a Shopaholic" in a tent, 120o farenheit, in Qatar.
> > >
> > > On Fri, Aug 26, 2011 at 8:29 AM, Jeff Hansen <dsche...@gmail.com>
> wrote:
> > >
> > > > Thanks for the math Ted -- that was very helpful.
> > > >
> > > > I've been using sparseMatrix() from libray(Matrix) -- largely based
> on
> > > your
> > > > response to somebody elses email.  I've been playing with smaller
> > > matrices
> > > > mainly for my own learning purposes -- it's much easier to read
> through
> > > 200
> > > > movies (most of which I've heard of) and get a gut feel, than 10,000
> > > > movies.
> > > >  It also means I don't have to shut down my session quite as often
> > (I've
> > > > been using rstudio) when I run something over the wrong
> > dimension(column
> > > vs
> > > > row).  I was running into some limitations as well, but I think some
> of
> > > > those may have had more to do with my own typos and misunderstanding
> of
> > > > that
> > > > language.
> > > >
> > > > There's a second reason I've been avoiding the tail -- while working
> on
> > > > this
> > > > as a possible project to propose, I had a bit of a "duh" moment.
> > >  Sometimes
> > > > it's important to realize the real world constraints.  Picture a
> > company
> > > > with very physical locations and very limited shelf space (say a
> > > > convenience
> > > > store or even a vending machine...)  Netflix and Amazon can make a
> lot
> > of
> > > > money in the long tail because they have centralized and or digital
> > > > inventories that service a large customer base.  Imagine a situation
> > > where
> > > > you have a small physical distributed inventory with an equally
> > > distributed
> > > > customer base.  In that situation it doesn't pay to chase after the
> > tail
> > > --
> > > > you just need to reduce your costs and target the head where you get
> > more
> > > > volume with less items.  So you really end up with a three tier
> market
> > > > segmentation -- one strategy works best for the head, another for the
> > > body,
> > > > and a third for the tail.
> > > >
> > > > As far as clusters go -- I really wasn't finding any clusters at the
> > > edges
> > > > of the data, but that could have more to do with not including the
> tail
> > > > (and
> > > > not normalizing appropriately for popularity).
> > > >
> > > > Incidentally, there was one amusingly cautionary anecdote I'd share
> --
> > I
> > > > don't remember the exact spread, but at one point I had two extremes
> > that
> > > > included something like "the johnson family vacation", "my baby's
> > daddy",
> > > > and "barbershop 2" on the one side, and then titles like "mean
> girls",
> > > "50
> > > > first dates" and "along came polly" on the other side.  You may have
> to
> > > > look
> > > > up those titles to see what I'm talking about, but I'd say the
> > > distinction
> > > > will be pretty clear when you do -- and if you were simply automating
> > all
> > > > of
> > > > this and not reviewing it with common sense, you could end up
> offending
> > > > some
> > > > of your users...  All I'm saying is there may be trends in the real
> > world
> > > > that some people aren't comfortable having pointed out.
> > > >
> > > > On Fri, Aug 26, 2011 at 4:08 AM, Lance Norskog <goks...@gmail.com>
> > > wrote:
> > > >
> > > > > This is a little meditation on user v.s. item matrix density. The
> > > > > heavy users and heavy items can be subsampled, once they are
> > > > > identified. Hadoop's built-in sort does give a very simple
> > > > > "map-increase" way to do this sort.
> > > > >
> > > > >
> > http://ultrawhizbang.blogspot.com/2011/08/sorted-recommender-data.html
> > > > >
> > > > > On Thu, Aug 25, 2011 at 5:57 PM, Ted Dunning <
> ted.dunn...@gmail.com>
> > > > > wrote:
> > > > > > In matrix terms the binary user x item matrix maps a set of items
> > to
> > > > > users
> > > > > > (A h = users who interacted with items in h).  Similarly A' maps
> > > users
> > > > to
> > > > > > items.  Thus A' (A h) is the classic "users who x'ed this also
> x'ed
> > > > that"
> > > > > > sort of operation.  This can be rearranged to be (A'A) h.
> > > > > >
> > > > > > This is where the singular values come in.  If
> > > > > >
> > > > > >     A \approx U S V'
> > > > > >
> > > > > > and we know that U'U = V'V = I from the definition of the SVD.
> >  This
> > > > > means
> > > > > > that
> > > > > >
> > > > > >    A' A \approx (U S V')' (U S V') = V S U' U S V' = V S^2 V'
> > > > > >
> > > > > > But V' transforms an item vector into the reduced dimensional
> space
> > > so
> > > > we
> > > > > > can view this as transforming the item vector into the reduced
> > > > > dimensional
> > > > > > space, scaling element-wise using S^2 and then transforming back
> to
> > > > item
> > > > > > space.
> > > > > >
> > > > > > As you noted, the first right singular vector corresponds to
> > > > popularity.
> > > > >  It
> > > > > > is often not appropriate to just recommend popular things (since
> > > users
> > > > > have
> > > > > > other means of discovering them) so you might want to drop that
> > > > dimension
> > > > > > from the analysis.  This can be done in the SVD results or it can
> > be
> > > > done
> > > > > > using something like a log-likelihood ratio test so that we only
> > > model
> > > > > > anomalously large values in A'A.
> > > > > >
> > > > > > Btw, if you continue to use R for experimentation, you should be
> > able
> > > > to
> > > > > > dramatically increase the size of the analysis you do by using
> > sparse
> > > > > > matrices and a random projection algorithm.  I was able to
> compute
> > > > SVD's
> > > > > of
> > > > > > matrices with millions of rows and columns and a few million
> > non-zero
> > > > > > elements in a few seconds.  Holler if you would like details
> about
> > > how
> > > > to
> > > > > do
> > > > > > this.
> > > > > >
> > > > > > On Thu, Aug 25, 2011 at 4:07 PM, Jeff Hansen <dsche...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > >> One thing I found interesting (but not particularly surprising)
> is
> > > > that
> > > > > the
> > > > > >> biggest singular value/vector was pretty much tied directly to
> > > volume.
> > > > > >>  That
> > > > > >> makes sense because the best predictor of whether a given fields
> > > value
> > > > > was
> > > > > >> 1
> > > > > >> was whether it belonged to a row with lots of 1s or a column
> with
> > > lots
> > > > > of
> > > > > >> 1s
> > > > > >> (I haven't quite figured out the best method to normalize the
> > values
> > > > > with
> > > > > >> yet).  When I plot the values of the largest singular vectors
> > > against
> > > > > the
> > > > > >> sum of values in the corresponding row or column, the
> correlation
> > is
> > > > > very
> > > > > >> linear.  That's not the case for any of the other singular
> vectors
> > > > > (which
> > > > > >> actually makes me wonder if just removing that first singular
> > vector
> > > > > from
> > > > > >> the prediction might not be one of the best ways to normalize
> the
> > > > data).
> > > > > >>
> > > > > >> I understand what you're saying about each singular vector
> > > > corresponding
> > > > > to
> > > > > >> a feature though.  Each left singular vector represents some
> > > abstract
> > > > > >> aspect
> > > > > >> of a movie and each right singular vector represents users
> > leanings
> > > or
> > > > > >> inclinations with regards to that aspect of the movie.  The
> > singular
> > > > > value
> > > > > >> itself just seems to indicate how good a predictor the
> combination
> > > of
> > > > > one
> > > > > >> users inclination toward that aspect of a movie is for coming up
> > > with
> > > > > the
> > > > > >> actual value.  The issue I mentioned above is that popularity of
> a
> > > > movie
> > > > > as
> > > > > >> well as how often a user watches movies tend to be the best
> > > predictors
> > > > > of
> > > > > >> whether a user has seen or will see a movie.
> > > > > >>
> > > > > >> I had been picturing this with the idea of one k dimensional
> space
> > > --
> > > > > one
> > > > > >> where a users location corresponds to their ideal prototypical
> > > movies
> > > > > >> location. Not that there would be a movie right there, but there
> > > would
> > > > > be
> > > > > >> ones nearby, and the nearer they were, the more enjoyable they
> > would
> > > > be.
> > > > > >>  That's a naive model, but that doesn't mean it wouldn't work
> well
> > > > > enough.
> > > > > >>  My problem is I don't quite get how to map the user space over
> to
> > > the
> > > > > item
> > > > > >> space.
> > > > > >>
> > > > > >> I think that may be what Lance is trying to describe in his last
> > > > > response,
> > > > > >> but I'm falling short on reconstructing the math from his
> > > description.
> > > > > >>
> > > > > >> I get the following
> > > > > >>
> > > > > >> U S V* = A
> > > > > >> U S = A V
> > > > > >> U = A V 1/S
> > > > > >>
> > > > > >> I think that last line is what Lance was describing.  Of course
> my
> > > > > problem
> > > > > >> was bootstrapping in a user for whom I don't know A or V.
> > > > > >>
> > > > > >> I also think I may have missed a big step of the puzzle.  For
> some
> > > > > reason I
> > > > > >> thought that just by loosening the rank, you could recompose the
> > > > Matrix
> > > > > A
> > > > > >> from the truncated SVD values/vectors and use the recomposed
> > values
> > > > > >> themselves as the recommendation.  I thought one of the ideas
> was
> > > that
> > > > > the
> > > > > >> recomposed matrix had less "noise" and could be a better
> > > > representation
> > > > > of
> > > > > >> the underlying nature of the matrix than the original matrix
> > itself.
> > > > >  But
> > > > > >> that may have just been wishful thinking...
> > > > > >>
> > > > > >> On Thu, Aug 25, 2011 at 4:50 PM, Sean Owen <sro...@gmail.com>
> > > wrote:
> > > > > >>
> > > > > >> > The 200x10 matrix is indeed a matrix of 10 singular vectors,
> > which
> > > > are
> > > > > >> > eigenvectors of AA'. It's the columns, not rows, that are
> > > > > >> > eigenvectors.
> > > > > >> >
> > > > > >> > The rows do mean something. I think it's fair to interpret the
> > 10
> > > > > >> > singular values / vectors as corresponding to some underlying
> > > > features
> > > > > >> > of tastes. The rows say how much each user expresses those 10
> > > > tastes.
> > > > > >> > The matrix of right singular vectors on the other side tells
> you
> > > the
> > > > > >> > same thing about items. The diagonal matrix of singular values
> > in
> > > > the
> > > > > >> > middle also comes into play -- it's like a set of multipliers
> > that
> > > > say
> > > > > >> > how important each feature is. (This is why we cut out the
> > > singular
> > > > > >> > vectors / values that have the smallest singular values; it's
> > like
> > > > > >> > removing the least-important features.) So really you'd have
> to
> > > > stick
> > > > > >> > those values somewhere; Ted says it's conventional to put
> "half"
> > > of
> > > > > >> > each (their square roots) with each side if anything.
> > > > > >> >
> > > > > >> > I don't have as good a grasp on an intuition for the columns
> as
> > > > > >> > eigenvectors. They're also a set of basis vectors, and I had
> > > > > >> > understood them as like the "real" bases of the reduced
> feature
> > > > space
> > > > > >> > expressed in user-item space. But I'd have to go back and
> think
> > > that
> > > > > >> > intuition through again since I can't really justify it from
> > > scratch
> > > > > >> > again in my head just now.
> > > > > >> >
> > > > > >> >
> > > > > >> > On Thu, Aug 25, 2011 at 10:21 PM, Jeff Hansen <
> > dsche...@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> > > Well, I think my problem may have had more to do with what I
> > was
> > > > > >> calling
> > > > > >> > the
> > > > > >> > > eigenvector...  I was referring to the rows rather than the
> > > > columns
> > > > > of
> > > > > >> U
> > > > > >> > and
> > > > > >> > > V.  While the columns may be characteristic of the overall
> > > matrix,
> > > > > the
> > > > > >> > rows
> > > > > >> > > are characteristic of the user or item (in that they are a
> > rank
> > > > > reduced
> > > > > >> > > representation of that person or thing). I guess you could
> say
> > I
> > > > > just
> > > > > >> had
> > > > > >> > to
> > > > > >> > > tilt my head to the side and change my perspective 90
> degrees
> > =)
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lance Norskog
> > > > > goks...@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Lance Norskog
> > > goks...@gmail.com
> > >
> >
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to