Re: Lanczos Algorithm

Derek O'Callaghan Tue, 23 Nov 2010 09:58:25 -0800

Hi Jake,

I have some related questions about the usage of the eigenvectors andeigenvalues generated by Lanczos, they're more or less on-topic so Ithought it'd be okay to post them here, but I can start a new thread ifyou like. I've been going through some of the mails on the dev listregarding the projection of a matrix onto an SVD basis which isgenerated by Lanczos, in order to reduce the dimensionality of thematrix columns. The new matrix is then passed to KMeans for clustering.


(see "Re: Using SVD with Canopy/KMeans" thread on the dev list)

On 03/09/10 04:28, Jeff Eastman wrote:

"...A P is the projection of the data set onto the eigenvector basis:

A = original data matrix ([15x39])
P = eigenvector column matrix  ([39x7])
D = eigenvalue diagonal matrix

Then:
A P = P D => A = P D P' "

where the projection of A is achieved with: A times P (seeTestClusterDumper.testKmeansSVD())


On 12/09/10 04:21, Jake Mannix wrote:

"To adapt this to do SVD, you just substitute A = C' * C, for any
(non-square) matrix C. You get a set of eigenvectors of C' * C, which in
turn are the right-singular vectors of C (what is called V, if C = U D V')."

This seems fine so far, in that the P in A times P is the V from the SVDdefinition in Jake's mail.


However, from an older thread ("distributed svd" on the user list):

On 24/07/10 00:09, Jake Mannix wrote:

"What you want to do here is actually take the A matrix, represented by a
DistributedRowMatrix, and the V_k matrix (you actually get basically V_k's
transposed from the SVD job), represented by that same class, and then just
compute A . V . S^-1, where S is the diagonal matrix of singular values
(note that these are the sqrts of the eigenvalues you get out of Lancos)."

From Jeff's mail above, and the code in TestClusterDumper, it seemslike the second multiplication by S^-1 step is not performed/required,i.e. the only step to project the original matrix A is:


Reduced matrix X = A . V (or A . P using Jeff's notation)

and the reduced matrix X can then be passed to KMeans for clustering. Iwanted to confirm if this is correct, and that the S (derived from theLanczos-generated eigenvalues) diagonal matrix can be ignored whenprojecting the original matrix? Is this the reason why Lanczos onlypersists the eigenvectors, and discards the eigenvalues(DistributedLanczosSolver.serializeOutput())?


Thanks,

Derek











On 22/11/10 22:01, Jake Mannix wrote:

Hi Pedro,

I wrote the Lanczos stuff in Mahout, but I've been swamped with my new job
the past few months, and haven't been able to interact on the listmuch, but
I've got a minute or two to try and answer your questions.

First: the LanczosSolver does indeed leave you with eigenvalues in the
opposite order than you expect (ascending order, instead ofdescending). Itshould be noted that in general, the Lanczos method *does not* giveyou the"largest N eigenvalues" - if you ask for a rank-N decomposition viaLanczos,you'll get the smallest eigenvalue for sure, then somewhat less thanN/2 ofthe other small eigenvalues (with the last few being right in themiddle ofthe spectrum), then somewhat less than N/2 of the largest eigenvalues(withthe first few being in the middle of the spectrum), ending with thelargest
eigenvalue.
If you really want the top N eigenvalues, then you should ask forsomewhere
between 3N and 4N rank decomposition, and take the last N of these. You
won't be guaranteed to get *exactly all* of the top N eigenvalues,
especially if you have a very rapidly decreasing eigenvalue spectrum(as is
usually the case with real data), and you reach the "plateau" of middling
values before N. In this case, if there are K<N "large" eigenvalues,you'll
get all K of these, and then N-K's worth of a sampling of the middling
eigenvalues.
One other thing you need to remember about the LanczosSolver: if yourinput
matrix is symmetric, pass in the boolean isSymmetric=true to the solve()
method, or else you'll get wrong values.

-jake

On Mon, Nov 22, 2010 at 1:07 PM, PEDRO MANUEL JIMENEZ RODRIGUEZ <
[email protected]> wrote:
Hi Ted,

I can't give you an exact amount but more or less it could be around 10^5
non-zero elements per row.

Could you please let me know, why the lanzcos algorithm is not always
returning the values in a decreasing order?

Thanks.

Pedro.

----------------------------------------
From: [email protected]
Date: Fri, 19 Nov 2010 13:34:19 -0800
Subject: Re: Lanczos Algorithm
To: [email protected]

How many non-zero elements?

On Fri, Nov 19, 2010 at 12:34 PM, PEDRO MANUEL JIMENEZ RODRIGUEZ <
[email protected]> wrote:
I was talking about 10^9 rows and 10^9 columns

----------------------------------------
From: [email protected]
Date: Fri, 19 Nov 2010 12:07:16 -0800
Subject: Re: Lanczos Algorithm
To: [email protected]

On Fri, Nov 19, 2010 at 11:17 AM, PEDRO MANUEL JIMENEZ RODRIGUEZ <
[email protected]> wrote:
In this project I would have to work with matrix of 10^9, which
have a
very
sparse data.
I think you mean 10^9 rows and 10^9 columns with much fewer 10^18
non-zero
elements.

Is that correct?

Can you say how many non-zero elements?

Re: Lanczos Algorithm

Reply via email to