Re: Symmetric eigendecomposition for kernel PCA

Steven Buss Wed, 24 Feb 2010 20:13:32 -0800

> You've done SVD on the 400K x 1.6M input matrix already, and found
> the results unsatisfactory?  What was it you were using the singular
> vectors for, and how did you do the svd computation?

We tried this approach first due to the stuff happening with the
Netflix prize. I first tried using SVDLIBC, but it was taking an
extremely long time and didn't give me any indication of how far it
had come in computing the SVD, so I gave up on it and tried a gradient
descent approach as described at
http://sifter.org/~simon/journal/20061211.html by Simon Funk. I used
the code at http://www.timelydevelopment.com/demos/NetflixPrize.aspx
to find the first 64 or so singular values and vectors. I honestly
cannot remember how I evaluated the results beyond looking at the
recommendations and not getting what I expected. (This was nearly 16
months ago, so maybe it's time to try SVD again using what I've
learned since then). One issue we have with the SVD is that it doesn't
have any statistical meaning, we'd prefer to have results with
probability and confidence.

I'm not trying to argue against the SVD, indeed it works great in some
fields, but we have found that it doesn't produce very good results on
our data. We built our dataset by using the Digg API to fetch three
months of digg activity. Here's some details:

470,640 users
1,606,789 articles
13,281,941 votes (0.00175% nonzero)
43% of users voted on 3 or fewer articles (one vote per month)
23% of users voted on more than 10 articles (87% of the data)
0.05% of users voted on more than 100 articles (71% of votes)

I'd love to hear the experiences of people working with similar datasets.

> H is the *distance* using your kernel function, or *similarity*?  If it's
> sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is the
> generalized inner product / kernel, right?

Right, similarity, sorry. And yes, H(i,j) = k(r_i, r_j). We're
currently using a jaccard kernel, but we haven't gotten far enough
into trying out this method to try other kernel functions.

> What I mean is this: when computing the SVD of your original matrix
> via Lanczos, you compute (A*A^t) * v by taking the rows r_i and
> summing up all of the: ( v.dot(r_i) ) * r_i  (a similar thing happens
> with GHA).  If your kernel can be represented as
>
>  k(r_i, r_j) = f(r_i).dot( f(r_j) )
>
> where f is some map from R^(A.numCols) to R^(kernelDimension)
> Then you can just apply the kernel to the rows as you iterate over
> them and compute the SVD of the kernelized form of A.

If I understand this correctly, this is essentially what we're doing,
just combined into the same step. Ultimately this approach will end up
computing the eigenvectors/values for the H matrix, but it only takes
the data matrix as input instead of the kernel matrix, right? It
doesn't seem to me like we're saving any cycles here, but that's not
to say that I'm opposed to the idea.

I'm not intimately familiar with how the Lanczos algorithm works, but
I am pretty sure that it will do several passes, meaning your
suggestion will either repeatedly calculate k(r_i, r_j), or have to
store the previous results. If its the former, then it doesn't seem
like a great approach since we're wasting cycles, and if it's the
latter then there's no real difference between passing in the kernel
matrix or finding it on the fly.

> The other problem with
> this approach is that you really need to make sure you're doing
> a hashing trick on the kernelized rows (a la stochastic
> decomposition), or else you're probably going to run into problems
> of size on your singular vector set (their dimensionality will be too
> big).  Stochastic decomposition is really the thing that will make
> this work much better (and I've got planned for the 0.4 release).

At the risk of sounding woefully ignorant, I don't know what this
means. I will read about it and hopefully have something meaningful to
say tomorrow :)

> What kinds of kernels are you planning to use?

Like I mentioned earlier, we're currently trying out the jaccard
kernel. My research partner is currently reading through Kernel
Methods for Pattern Analysis, which will hopefully have some good
suggestions. We plan to try some graph kernels as well (but there are
no specifics yet).

> How do you
> plan on dealing with the userSparsity * numUsers^2 * kernelCost
> computational cost of producing H?  That's even higher than the
> desiredRank * numUsers * userSparsity cost of Lanczos.

It's an easy enough algorithm to MapReduce the calculation of H for
most kernels. The output of this is about 50GB from a ~150MB input
matrix. With thresholding, we can shrink this down a bit, and it might
even be the case that we don't need to find H for all users. We might
be able to pick out some subset of users. We're doing this now with a
10% random sample, and can run kPCA on a single machine.

Isn't the cost of Lanczos you wrote ignoring the cost of evaluating
the kernel function?

Steven Buss
[email protected]
http://www.stevenbuss.com/

On Wed, Feb 24, 2010 at 5:30 PM, Jake Mannix <[email protected]> wrote:
> Hi Steven,
>
>  Glad to see you here in the land of more-than-140chars!
>
> On Wed, Feb 24, 2010 at 1:53 PM, Steven Buss <[email protected]> wrote:
>>
>>
>> My research partner and I have a dataset that consists of 400,000
>> users and 1.6 million articles, with about 22 million nonzeros. We are
>> trying to use this data to make recommendations to users. We have
>> tried using the SVD and PLSI, both with unsatisfactory results, and
>> are now attempting kPCA.
>>
>
> You've done SVD on the 400K x 1.6M input matrix already, and found
> the results unsatisfactory?  What was it you were using the singular
> vectors for, and how did you do the svd computation?
>
>
>> We have a 400,000 by 400,000 sparse symmetric positive-definite
>> matrix, H, that we need the top couple hundred eigenvectors/values
>> for. Jake has told me that I can use mahout 180 unchanged, but it will
>> be doing redundant work and the output eigenvalues are the squares of
>> the ones we actually want. This sounds like a good approach, but it
>> would be great if mahout had an optimized eigendecomposition for
>> symmetric matrices. Jake suggested I submit a JIRA ticket regarding
>> this, which I plan to do.
>>
>
> Depending on how much RAM you have on your box, "couple hundred"
> eigenvectors on 400k dimensions is at least "a couple" GB, and since
> Lanczos tends to get less eigenvectors than you expect (so you need
> to ask for more than you think you want), "a couple" could turn into
> maybe 10GB.  Doable, but hitting the boundaries of scalability.
>
> I'm in the process of filing another JIRA ticket concerning removing
> the need to keep the full Lanczos basis in RAM (because it's not really
> necessary, although it will be slower without the RAM available), to
> make this more scalable (the current implementation scales to "infinite"
> row size, but not infinite column size).
>
> H is the pairwise distance in feature space (calculated using a kernel
>> function) between each pair of users (or some subset of users). After
>> I mentioned this to Jake, he asked me "why aren't you just doing it
>> all in one go? Kernelize on the rows, and do SVD on that? Why do the
>> M*M^t intermediate step?" Unfortunately, I'm not sure what you're
>> asking, Jake, can you clarify?
>>
>
> H is the *distance* using your kernel function, or *similarity*?  If it's
> sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is the
> generalized inner product / kernel, right?
>
> What I mean is this: when computing the SVD of your original matrix
> via Lanczos, you compute (A*A^t) * v by taking the rows r_i and
> summing up all of the: ( v.dot(r_i) ) * r_i  (a similar thing happens
> with GHA).  If your kernel can be represented as
>
>  k(r_i, r_j) = f(r_i).dot( f(r_j) )
>
> where f is some map from R^(A.numCols) to R^(kernelDimension)
> Then you can just apply the kernel to the rows as you iterate over
> them and compute the SVD of the kernelized form of A.
>
> Of course, if you're not doing a kernel representable in that way,
> then you're not getting far down this road.  The other problem with
> this approach is that you really need to make sure you're doing
> a hashing trick on the kernelized rows (a la stochastic
> decomposition), or else you're probably going to run into problems
> of size on your singular vector set (their dimensionality will be too
> big).  Stochastic decomposition is really the thing that will make
> this work much better (and I've got planned for the 0.4 release).
>
> What kinds of kernels are you planning to use?  How do you
> plan on dealing with the userSparsity * numUsers^2 * kernelCost
> computational cost of producing H?  That's even higher than the
> desiredRank * numUsers * userSparsity cost of Lanczos.
>
>  -jake
>

Re: Symmetric eigendecomposition for kernel PCA

Reply via email to