That helps a lot, Jake, I want to thank you so much for your patience answering all of our questions.
Regarding the negative eigenvalues issue, I have obtained today a similar result: the smallest eigenvalue negative (but very close to 0) and the biggest one bigger than 1 (~1.02), but I must point that this happened with artificially augmented data (copied and pasted the same 4 rows hundreds of times to make some performance tests) so I think the problem can be related to this. Pedro, are your data artificially generated? Last, I think that it would be helpful for other people if these questions are somewhat moved to the Mahout wiki (maybe some kind of FAQ) since I think they are becoming so "frequent". Thanks a lot again Jake. Best, Fernando. 2010/11/23 Jake Mannix <[email protected]> > On Tue, Nov 23, 2010 at 9:58 AM, Derek O'Callaghan > <[email protected]>wrote: > > > Hi Jake, > > > > I have some related questions about the usage of the eigenvectors and > > eigenvalues generated by Lanczos, they're more or less on-topic so I > thought > > it'd be okay to post them here, but I can start a new thread if you like. > > I've been going through some of the mails on the dev list regarding the > > projection of a matrix onto an SVD basis which is generated by Lanczos, > in > > order to reduce the dimensionality of the matrix columns. The new matrix > is > > then passed to KMeans for clustering. > > > > Ok, sounds good. > > > > From Jeff's mail above, and the code in TestClusterDumper, it seems like > > the second multiplication by S^-1 step is not performed/required, i.e. > the > > only step to project the original matrix A is: > > > > Reduced matrix X = A . V (or A . P using Jeff's notation) > > > > Not sure about what is done in TestClusterDumper, but in general, to > project > the original rows of your matrix onto the reduced space defined by the > decomposition, you do want to rescale by S^-1, or else you'll basically > find > that all of your rows seem to point in the direction of the largest > eigenvector (that's why it's the largest eigenvector: most of the matrix > points in it's direction!). > > > > and the reduced matrix X can then be passed to KMeans for clustering. I > > wanted to confirm if this is correct, and that the S (derived from the > > Lanczos-generated eigenvalues) diagonal matrix can be ignored when > > projecting the original matrix? Is this the reason why Lanczos only > persists > > the eigenvectors, and discards the eigenvalues > > (DistributedLanczosSolver.serializeOutput())? > > > > I don't think so. I think you do want the eigenvalues as well. Because > Lanczos can sometimes have stability issues, and end up with repeats of > eigenvector/eigenvalue pairs, you need to do some checking on the output. > This is done in the EigenVerificationJob class, which takes your original > corpus, and the supposed eigenvectors (doesn't need the eigenvalues), and > throws away any duplicates or incorrect vectors, and recomputes the > eigenvalues/singular values and indeed stores them as well as the vectors > (see the method saveCleanEigens() ). > > These recent discussions reminds me that the EigenVerificationJob needs to > be just folded into the DistributedLanczosSolver, because it's confusing > and > nobody sees that they typically need to use it. > > -jake >
