[ https://issues.apache.org/jira/browse/MAHOUT-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142629#comment-13142629 ]
Jake Mannix commented on MAHOUT-524: ------------------------------------ I don't really know anything about the way that SKMD works, so all I can weight in is what's going on in Lanczos: You take an input matrix with some number of rows (this number doesn't matter, doesn't show up anywhere) and numCols columns (this number matters a lot). You want desiredRank eigenvectors to pop out in the end. So you start with some initial basisVector (number 0), and you iterate again and again taking your input corpus.timesSquared(basisIminusOne) (resultant vector is of size numCols), do some orthogonalization against previous vectors, hang onto this vector. Eventually you have desiredRank basisVectors, arranged in the LanczosState object in a Map<Integer,Vector> (it could be a Matrix, certainly, it is, but we're just hanging onto it before building a matrix soon enough). Meanwhile, we're building up a desiredRank x desiredRank tri-diagonal (ie very sparse) matrix using these basis vectors and their inner products. Now we ask COLT to get the eigenvectors and eigenvalues of the tridiagonal matrix, there will be desiredRank eigenvalues, and desiredRank eigenVectors (each of dimension desiredRank). Here we get to where you're getting an NPE. We walk along the desiredRank^2 values in the eigenvector matrix ("eigenVects"), and for each of 0... desiredRank, we grab the basisVector (we have desiredRank of them, each of size numCols) and add a linear multiple of it onto something which will be the final eigenvector we'll return at the end of the day. What is SKMD doing? [code] LanczosState state = new LanczosState(L, overshoot, numDims, solver.getInitialVector(L)); Path lanczosSeqFiles = new Path(outputCalc, "eigenvectors-" + (System.nanoTime() & 0xFF)); solver.runJob(conf, state, overshoot, true, lanczosSeqFiles.toString()); [code] We're making a LanczosState with specifying numCols = overshoot, desiredRank = numDims. Then we run the solver with desiredRank = overshoot. Looks like this is inconsistent, the desiredRank should be the same? > DisplaySpectralKMeans example fails > ----------------------------------- > > Key: MAHOUT-524 > URL: https://issues.apache.org/jira/browse/MAHOUT-524 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.4, 0.5 > Reporter: Jeff Eastman > Assignee: Shannon Quinn > Labels: clustering, k-means, visualization > Fix For: 0.6 > > Attachments: EclipseLog_20110918.txt, MAHOUT-524.patch, > MAHOUT-524.patch, SpectralKMeans_fail_20110919.txt, aff.txt, raw.txt, > screenshot-1.jpg, spectralkmeans.png > > > I've committed a new display example that attempts to push the standard > mixture of models data set through spectral k-means. After some tweaking of > configuration arguments and a bug fix in EigenCleanupJob it runs spectral > k-means to completion. The display example is expecting 2-d clustered points > and the example is producing 5-d points. Additional I/O work is needed before > this will play with the rest of the clustering algorithms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira