Hi Jake,

I have some related questions about the usage of the eigenvectors and eigenvalues generated by Lanczos, they're more or less on-topic so I thought it'd be okay to post them here, but I can start a new thread if you like. I've been going through some of the mails on the dev list regarding the projection of a matrix onto an SVD basis which is generated by Lanczos, in order to reduce the dimensionality of the matrix columns. The new matrix is then passed to KMeans for clustering.

(see "Re: Using SVD with Canopy/KMeans" thread on the dev list)

On 03/09/10 04:28, Jeff Eastman wrote:

"...A P is the projection of the data set onto the eigenvector basis:

A = original data matrix ([15x39])
P = eigenvector column matrix  ([39x7])
D = eigenvalue diagonal matrix

Then:
A P = P D => A = P D P' "

where the projection of A is achieved with: A times P (see TestClusterDumper.testKmeansSVD())

On 12/09/10 04:21, Jake Mannix wrote:

"To adapt this to do SVD, you just substitute A = C' * C, for any
(non-square) matrix C. You get a set of eigenvectors of C' * C, which in
turn are the right-singular vectors of C (what is called V, if C = U D V')."

This seems fine so far, in that the P in A times P is the V from the SVD definition in Jake's mail.

However, from an older thread ("distributed svd" on the user list):

On 24/07/10 00:09, Jake Mannix wrote:

"What you want to do here is actually take the A matrix, represented by a
DistributedRowMatrix, and the V_k matrix (you actually get basically V_k's
transposed from the SVD job), represented by that same class, and then just
compute A . V . S^-1, where S is the diagonal matrix of singular values
(note that these are the sqrts of the eigenvalues you get out of Lancos)."


From Jeff's mail above, and the code in TestClusterDumper, it seems like the second multiplication by S^-1 step is not performed/required, i.e. the only step to project the original matrix A is:

Reduced matrix X = A . V (or A . P using Jeff's notation)

and the reduced matrix X can then be passed to KMeans for clustering. I wanted to confirm if this is correct, and that the S (derived from the Lanczos-generated eigenvalues) diagonal matrix can be ignored when projecting the original matrix? Is this the reason why Lanczos only persists the eigenvectors, and discards the eigenvalues (DistributedLanczosSolver.serializeOutput())?

Thanks,

Derek











On 22/11/10 22:01, Jake Mannix wrote:
Hi Pedro,

I wrote the Lanczos stuff in Mahout, but I've been swamped with my new job
the past few months, and haven't been able to interact on the list much, but
I've got a minute or two to try and answer your questions.

First: the LanczosSolver does indeed leave you with eigenvalues in the
opposite order than you expect (ascending order, instead of descending). It should be noted that in general, the Lanczos method *does not* give you the "largest N eigenvalues" - if you ask for a rank-N decomposition via Lanczos, you'll get the smallest eigenvalue for sure, then somewhat less than N/2 of the other small eigenvalues (with the last few being right in the middle of the spectrum), then somewhat less than N/2 of the largest eigenvalues (with the first few being in the middle of the spectrum), ending with the largest
eigenvalue.

If you really want the top N eigenvalues, then you should ask for somewhere
between 3N and 4N rank decomposition, and take the last N of these. You
won't be guaranteed to get *exactly all* of the top N eigenvalues,
especially if you have a very rapidly decreasing eigenvalue spectrum (as is
usually the case with real data), and you reach the "plateau" of middling
values before N. In this case, if there are K<N "large" eigenvalues, you'll
get all K of these, and then N-K's worth of a sampling of the middling
eigenvalues.

One other thing you need to remember about the LanczosSolver: if your input
matrix is symmetric, pass in the boolean isSymmetric=true to the solve()
method, or else you'll get wrong values.

-jake

On Mon, Nov 22, 2010 at 1:07 PM, PEDRO MANUEL JIMENEZ RODRIGUEZ <
[email protected]> wrote:


Hi Ted,

I can't give you an exact amount but more or less it could be around 10^5
non-zero elements per row.

Could you please let me know, why the lanzcos algorithm is not always
returning the values in a decreasing order?

Thanks.

Pedro.

----------------------------------------
From: [email protected]
Date: Fri, 19 Nov 2010 13:34:19 -0800
Subject: Re: Lanczos Algorithm
To: [email protected]

How many non-zero elements?

On Fri, Nov 19, 2010 at 12:34 PM, PEDRO MANUEL JIMENEZ RODRIGUEZ <
[email protected]> wrote:



I was talking about 10^9 rows and 10^9 columns

----------------------------------------
From: [email protected]
Date: Fri, 19 Nov 2010 12:07:16 -0800
Subject: Re: Lanczos Algorithm
To: [email protected]

On Fri, Nov 19, 2010 at 11:17 AM, PEDRO MANUEL JIMENEZ RODRIGUEZ <
[email protected]> wrote:

In this project I would have to work with matrix of 10^9, which
have a
very
sparse data.


I think you mean 10^9 rows and 10^9 columns with much fewer 10^18
non-zero
elements.

Is that correct?

Can you say how many non-zero elements?





Reply via email to