[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

Yiqun Hu (JIRA) Thu, 06 Jun 2013 23:33:46 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677854#comment-13677854
 ]


Yiqun Hu commented on MAHOUT-1214:
----------------------------------

How to test it
1. Checkout v0.7 from SVN, download and apply the SpectralKMeans.patch;
2. Recomplie mahout;
3. Download either matrix_1 or matrix_2 and upload to HDFS (e.g. input folder)
4. Start spectral kmeans job
   bin/mahout spectralkmeans -i input -o output -d 7 -k 3 -x 100 -cd 0.001 -ow

The result will be both output to console and store in output folder.
                
> Improve the accuracy of the Spectral KMeans Method
> --------------------------------------------------
>
>                 Key: MAHOUT-1214
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.7
>         Environment: Mahout 0.7
>            Reporter: Yiqun Hu
>              Labels: clustering, improvement
>             Fix For: Backlog
>
>         Attachments: matrix_1, matrix_2, SpectralKMeans.patch
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

Reply via email to