On Wed, Feb 24, 2010 at 8:12 PM, Steven Buss <[email protected]> wrote:

> > You've done SVD on the 400K x 1.6M input matrix already, and found
> > the results unsatisfactory?  What was it you were using the singular
> > vectors for, and how did you do the svd computation?
>
> We tried this approach first due to the stuff happening with the
> Netflix prize. I first tried using SVDLIBC, but it was taking an
> extremely long time and didn't give me any indication of how far it
> had come in computing the SVD, so I gave up on it and tried a gradient
> descent approach as described at
> http://sifter.org/~simon/journal/20061211.html by Simon Funk.


Ok, cool.  Brandyn Webb (aka Simon Funk) is actually the co-author of
the papre I used for the GHA code for SVD currently in Mahout as well, it's
very fast, for a non-parallel algorithm, and can certainly scale pretty well
to
the sizes you're talking about, and in fact should allow you to just plug in
your kernel also (in fact, I think he describes doing that in the article
you
linked to).


> I used
> the code at http://www.timelydevelopment.com/demos/NetflixPrize.aspx
> to find the first 64 or so singular values and vectors. I honestly
> cannot remember how I evaluated the results beyond looking at the
> recommendations and not getting what I expected. (This was nearly 16
> months ago, so maybe it's time to try SVD again using what I've
> learned since then).


It is pretty important to have a good metric on your results, because little
tweaks (how you normalized your input vectors, whether you mean-center
them, how you deal with the missing data, etc...) can have significant
effects on the results.


> One issue we have with the SVD is that it doesn't
> have any statistical meaning, we'd prefer to have results with
> probability and confidence.
>

Yeah, SVD can give pretty crappy results if you have a lot of
very small counts, and moving to something like Latent Dirchlet
Allocation (also available in Mahout!) might do better.


> 470,640 users
> 1,606,789 articles
> 13,281,941 votes (0.00175% nonzero)
> 43% of users voted on 3 or fewer articles (one vote per month)
> 23% of users voted on more than 10 articles (87% of the data)
> 0.05% of users voted on more than 100 articles (71% of votes)
>

Interesting data set - far more "items" than users, and *really*
sparse.  SVD could definitely give you super crappy results for
this data set, if my intuition is right.


> > H is the *distance* using your kernel function, or *similarity*?  If it's
> > sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is
> the
> > generalized inner product / kernel, right?
>
> Right, similarity, sorry. And yes, H(i,j) = k(r_i, r_j). We're
> currently using a jaccard kernel, but we haven't gotten far enough
> into trying out this method to try other kernel functions.
>

Ok cool.  For your data set, I bet Jaccard (Tanimoto) would work
pretty well.  Well, certainly better than euclidean or cosine.  But it
doesn't do anything for your super sparse situation.  I'm not sure
what would off the top of my head.


> > What I mean is this: when computing the SVD of your original matrix
> > via Lanczos, you compute (A*A^t) * v by taking the rows r_i and
> > summing up all of the: ( v.dot(r_i) ) * r_i  (a similar thing happens
> > with GHA).  If your kernel can be represented as
> >
> >  k(r_i, r_j) = f(r_i).dot( f(r_j) )
> >
> > where f is some map from R^(A.numCols) to R^(kernelDimension)
> > Then you can just apply the kernel to the rows as you iterate over
> > them and compute the SVD of the kernelized form of A.
>
> If I understand this correctly, this is essentially what we're doing,
> just combined into the same step. Ultimately this approach will end up
> computing the eigenvectors/values for the H matrix, but it only takes
> the data matrix as input instead of the kernel matrix, right? It
> doesn't seem to me like we're saving any cycles here, but that's not
> to say that I'm opposed to the idea.
>

Yes, you are essentially computing the eigenvectors/values of H, so
the question is whether it does save cycles.  To jump ahead to
your questioning of my scaling analysis at the end:

Compute H:

  O(numUsers^2 * kernelCost(userSparsity))

where properly, the cost of applying the kernel is a function of
userSparsity = number of nonzero elements in a given row.

Then inevitably H will be a bit more dense than your original
matrix (but not terribly so - it seems that you may have a
significant proportion of the rows of (H - Identity) which have
*no* nonzero entries, which makes it very hard to do
recommendations for these users.

But regardless, if you then run Lanczos eigendecomposition on
H, it'll be nice and fast, scaling something like

O(k * numUsers * numSimilarUsersPerUser ).

If instead you computed kernelized Lanczos all in one fell
swoop, you'll be making k passes over the data, and computing
the kernel each time, yes, but the scaling is just that of Lanczos:

O(k * numUsers * kernelCost(userSparsity) )

(as you mention, I forgot the kernel cost here, yes)

This is tremendously better than computing the full kernel matrix,
by a factor of k / numUsers = 5000 or so.  Unless I'm missing
a factor somewhere.



> I'm not intimately familiar with how the Lanczos algorithm works, but
> I am pretty sure that it will do several passes, meaning your
> suggestion will either repeatedly calculate k(r_i, r_j), or have to
> store the previous results. If its the former, then it doesn't seem
> like a great approach since we're wasting cycles, and if it's the
>

Is the former, but the difference is between computing the k(r_i, r_j)
for all i and j, and computing k(r_i, V) for all i, for a set of only K =
100's of dense eigenvectors (well, Lanczos vectors, used to produce
eigenvectors).  Way way way less computation of the kernel.


> > The other problem with
> > this approach is that you really need to make sure you're doing
> > a hashing trick on the kernelized rows (a la stochastic
> > decomposition), or else you're probably going to run into problems
> > of size on your singular vector set (their dimensionality will be too
> > big).  Stochastic decomposition is really the thing that will make
> > this work much better (and I've got planned for the 0.4 release).
>
> At the risk of sounding woefully ignorant, I don't know what this
> means. I will read about it and hopefully have something meaningful to
> say tomorrow :)
>

Check out: http://arxiv.org/abs/0909.4061
It's a great review, and it allows really interesting kernels (even the
infinite
dimensional ones) to be treated in the same way as I'm describing above,
without incurring the O(numUsers^2) cost of computing the full kernel
matrix.

It's probabalistic in the sense that it is not guaranteed to work, but the
bounds
of the chance of failure are so good that you don't really need to worry
about
it (or rather, you worry about it just enough to be sure you've chosen your
parameters such that it won't fail unless the same day you win the lottery
:) ).


> Like I mentioned earlier, we're currently trying out the jaccard
> kernel. My research partner is currently reading through Kernel
> Methods for Pattern Analysis, which will hopefully have some good
> suggestions. We plan to try some graph kernels as well (but there are
> no specifics yet).
>

I'd love to hear your thoughts on what kinds of kernels do well, because
we don't have much kernel work in Mahout yet, and knowing what
people like / fine useful can help us know what to add to the libraries
here (plus I just find it interesting).


> It's an easy enough algorithm to MapReduce the calculation of H for
> most kernels. The output of this is about 50GB from a ~150MB input
> matrix. With thresholding, we can shrink this down a bit, and it might
> even be the case that we don't need to find H for all users. We might
> be able to pick out some subset of users. We're doing this now with a
> 10% random sample, and can run kPCA on a single machine.
>

Oh it's certainly parallelizable, but it's still O(numUsers^2). Well,
actually
no, you're right.  If you properly MapReduce this matrix product sparsely,
you can quickly tell that at least for things like euclidean inner products,
cosine, and jaccard / tanimoto, you don't even need to do any work for
a significant chunk of the user-user pairs.  The question then becomes,
is it such that for any given user, you have to compute less than k =
reducedRank of your dimensional reduction kernel inner products.  If
so, then your MapReduce will indeed be more efficient than doing the
one-pass kernelized Lanczos.  Because the kernel computation is
typically very very fast for each user-user pair (probably safe to consider
it O(1) when nonzero), while for user-denseEigen pairs it'll scale as the
number of nonzero entries in k(r_i).

And you know, if you've got some clever MapReduce code to compute
the full sparse kernelized H computation,  we could certainly use it here
in Mahout!

  -jake

Reply via email to