[ 
https://issues.apache.org/jira/browse/MAHOUT-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143510#comment-13143510
 ] 

Shannon Quinn edited comment on MAHOUT-524 at 11/3/11 8:33 PM:
---------------------------------------------------------------

After implementing the same code in Python, my suspicions are actually that the 
clusters of the K-means at the conclusion of the spectral algorithm are 
throwing off the final results shown in DisplaySKM. Regular K-means is running 
on the spectral data: the top k-eigenvectors of the affinities, rather than the 
original data. I don't know K-means well enough to know for sure, but my guess 
is that all the distance measurements that come back in its output format are 
relative to the spectral data, rather than the original data. So what you see 
in the end-result graph are circles around where the spectral data are.

That'd be my first guess, anyway. I'm working on a couple things to help with 
this: a sequential version of spectral k-means, and a job to read raw data 
(text format: whitespace or comma-separated n-dimensional points) and convert 
it to affinities (a la issue 518, finally!). Hopefully these will help diagnose 
spectral k-means.

But if it is a data issue, I'm not sure how we can translate the distance 
measurements on the spectral data back onto the original data for the 
DisplaySKM code. I would argue, though, that since spectral k-means doesn't 
operate on the same GMM-type basis that regular K-means does, overlaying K 
gaussians isn't really what we want here, anyway. If at all possible, my 
suggestion would be colored dots to indicate the clusters. 
                
      was (Author: magsol):
    After implementing the same code in Python, my suspicions are actually that 
the results of the K-means at the conclusion of the spectral algorithm is 
throwing off the results. Regular K-means is running on the spectral data: the 
top k-eigenvectors of the affinities, rather than the original data. I don't 
know K-means well enough to know for sure, but my guess is that all the 
distance measurements that come back in its output format are relative to the 
spectral data, rather than the original data. So what you see in the end-result 
graph are circles around where the spectral data are.

That'd be my first guess, anyway. I'm working on a couple things to help with 
this: a sequential version of spectral k-means, and a job to read raw data 
(text format: whitespace or comma-separated n-dimensional points) and convert 
it to affinities (a la issue 518, finally!). Hopefully these will help diagnose 
spectral k-means.

But if it is a data issue, I'm not sure how we can translate the distance 
measurements on the spectral data back onto the original data for the 
DisplaySKM code. I would argue, though, that since spectral k-means doesn't 
operate on the same GMM-type basis that regular K-means does, overlaying K 
gaussians isn't really what we want here, anyway. If at all possible, my 
suggestion would be colored dots to indicate the clusters. 
                  
> DisplaySpectralKMeans example fails
> -----------------------------------
>
>                 Key: MAHOUT-524
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-524
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Jeff Eastman
>            Assignee: Shannon Quinn
>              Labels: clustering, k-means, visualization
>             Fix For: 0.6
>
>         Attachments: EclipseLog_20110918.txt, MAHOUT-524.patch, 
> MAHOUT-524.patch, SpectralKMeans_fail_20110919.txt, aff.txt, raw.txt, 
> screenshot-1.jpg, spectralkmeans.png
>
>
> I've committed a new display example that attempts to push the standard 
> mixture of models data set through spectral k-means. After some tweaking of 
> configuration arguments and a bug fix in EigenCleanupJob it runs spectral 
> k-means to completion. The display example is expecting 2-d clustered points 
> and the example is producing 5-d points. Additional I/O work is needed before 
> this will play with the rest of the clustering algorithms. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to