[ 
https://issues.apache.org/jira/browse/MAHOUT-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142629#comment-13142629
 ] 

Jake Mannix commented on MAHOUT-524:
------------------------------------

I don't really know anything about the way that SKMD works, so all I can weight 
in is what's going on in Lanczos:

You take an input matrix with some number of rows (this number doesn't matter, 
doesn't show up anywhere) and numCols columns (this number matters a lot).  You 
want desiredRank eigenvectors to pop out in the end.  So you start with some 
initial basisVector (number 0), and you iterate again and again taking your 
input corpus.timesSquared(basisIminusOne) (resultant vector is of size 
numCols), do some orthogonalization against previous vectors, hang onto this 
vector.

Eventually you have desiredRank basisVectors, arranged in the LanczosState 
object in a Map<Integer,Vector> (it could be a Matrix, certainly, it is, but 
we're just hanging onto it before building a matrix soon enough).  Meanwhile, 
we're building up a desiredRank x desiredRank tri-diagonal (ie very sparse) 
matrix using these basis vectors and their inner products.

Now we ask COLT to get the eigenvectors and eigenvalues of the tridiagonal 
matrix, there will be desiredRank eigenvalues, and desiredRank eigenVectors 
(each of dimension desiredRank).

Here we get to where you're getting an NPE.  We walk along the desiredRank^2 
values in the eigenvector matrix ("eigenVects"), and for each of 0... 
desiredRank, we grab the basisVector (we have desiredRank of them, each of size 
numCols) and add a linear multiple of it onto something which will be the final 
eigenvector we'll return at the end of the day.

What is SKMD doing?  

[code]
    LanczosState state = new LanczosState(L, overshoot, numDims, 
solver.getInitialVector(L));
    Path lanczosSeqFiles = new Path(outputCalc, "eigenvectors-" + 
(System.nanoTime() & 0xFF));
    solver.runJob(conf,
                  state,
                  overshoot,
                  true,
                  lanczosSeqFiles.toString());
[code]

We're making a LanczosState with specifying numCols = overshoot, desiredRank = 
numDims.

Then we run the solver with desiredRank = overshoot.

Looks like this is inconsistent, the desiredRank should be the same?
                
> DisplaySpectralKMeans example fails
> -----------------------------------
>
>                 Key: MAHOUT-524
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-524
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Jeff Eastman
>            Assignee: Shannon Quinn
>              Labels: clustering, k-means, visualization
>             Fix For: 0.6
>
>         Attachments: EclipseLog_20110918.txt, MAHOUT-524.patch, 
> MAHOUT-524.patch, SpectralKMeans_fail_20110919.txt, aff.txt, raw.txt, 
> screenshot-1.jpg, spectralkmeans.png
>
>
> I've committed a new display example that attempts to push the standard 
> mixture of models data set through spectral k-means. After some tweaking of 
> configuration arguments and a bug fix in EigenCleanupJob it runs spectral 
> k-means to completion. The display example is expecting 2-d clustered points 
> and the example is producing 5-d points. Additional I/O work is needed before 
> this will play with the rest of the clustering algorithms. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to