On Tue, Nov 23, 2010 at 9:58 AM, Derek O'Callaghan <[email protected]>wrote:
> Hi Jake, > > I have some related questions about the usage of the eigenvectors and > eigenvalues generated by Lanczos, they're more or less on-topic so I thought > it'd be okay to post them here, but I can start a new thread if you like. > I've been going through some of the mails on the dev list regarding the > projection of a matrix onto an SVD basis which is generated by Lanczos, in > order to reduce the dimensionality of the matrix columns. The new matrix is > then passed to KMeans for clustering. > Ok, sounds good. > From Jeff's mail above, and the code in TestClusterDumper, it seems like > the second multiplication by S^-1 step is not performed/required, i.e. the > only step to project the original matrix A is: > > Reduced matrix X = A . V (or A . P using Jeff's notation) > Not sure about what is done in TestClusterDumper, but in general, to project the original rows of your matrix onto the reduced space defined by the decomposition, you do want to rescale by S^-1, or else you'll basically find that all of your rows seem to point in the direction of the largest eigenvector (that's why it's the largest eigenvector: most of the matrix points in it's direction!). > and the reduced matrix X can then be passed to KMeans for clustering. I > wanted to confirm if this is correct, and that the S (derived from the > Lanczos-generated eigenvalues) diagonal matrix can be ignored when > projecting the original matrix? Is this the reason why Lanczos only persists > the eigenvectors, and discards the eigenvalues > (DistributedLanczosSolver.serializeOutput())? > I don't think so. I think you do want the eigenvalues as well. Because Lanczos can sometimes have stability issues, and end up with repeats of eigenvector/eigenvalue pairs, you need to do some checking on the output. This is done in the EigenVerificationJob class, which takes your original corpus, and the supposed eigenvectors (doesn't need the eigenvalues), and throws away any duplicates or incorrect vectors, and recomputes the eigenvalues/singular values and indeed stores them as well as the vectors (see the method saveCleanEigens() ). These recent discussions reminds me that the EigenVerificationJob needs to be just folded into the DistributedLanczosSolver, because it's confusing and nobody sees that they typically need to use it. -jake
