Hi Jake,

Thanks for the clarification regarding S-1. FYI the EigenVerificationJob is already included in one of the DistributedLanczosSolver run() methods. I see the EigenVector instances being written in EigenVerificationJob.saveCleanEigens(), however, when they're read back in TestClusterDumper.testKmeansSVD(), the vectors are actually DenseVector instances, not EigenVectors, and so the associated eigenValue is lost as it's currently encapsulated in EigenVector.name. I think VectorWritable is just persisting DenseVectors and isn't aware of EigenVectors, but I'd need to dig a bit deeper to confirm.

I just wanted to confirm that S should be constructed using the sqrts of the eigenvalues generated by Lanczos/EigenVerificationJob?

Thanks again,

Derek

On 23/11/10 22:03, Jake Mannix wrote:
Not sure about what is done in TestClusterDumper, but in general, to project
the original rows of your matrix onto the reduced space defined by the
decomposition, you do want to rescale by S^-1, or else you'll basically find
that all of your rows seem to point in the direction of the largest
eigenvector (that's why it's the largest eigenvector: most of the matrix
points in it's direction!).


and the reduced matrix X can then be passed to KMeans for clustering. I
wanted to confirm if this is correct, and that the S (derived from the
Lanczos-generated eigenvalues) diagonal matrix can be ignored when
projecting the original matrix? Is this the reason why Lanczos only persists
the eigenvectors, and discards the eigenvalues
(DistributedLanczosSolver.serializeOutput())?

I don't think so.  I think you do want the eigenvalues as well.  Because
Lanczos can sometimes have stability issues, and end up with repeats of
eigenvector/eigenvalue pairs, you need to do some checking on the output.
  This is done in the EigenVerificationJob class, which takes your original
corpus, and the supposed eigenvectors (doesn't need the eigenvalues), and
throws away any duplicates or incorrect vectors, and recomputes the
eigenvalues/singular values and indeed stores them as well as the vectors
(see the method saveCleanEigens() ).

These recent discussions reminds me that the EigenVerificationJob needs to
be just folded into the DistributedLanczosSolver, because it's confusing and
nobody sees that they typically need to use it.

   -jake

Reply via email to