[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692589#comment-13692589
 ] 

Yiqun Hu commented on MAHOUT-1214:
----------------------------------

Hi, Robin,
We also response to your comments about why a new input format is used. Please 
check our response in reviewboard. Because we introduce a new support for 
spectralkmeans in mahout: we allow user to specify affinity between data using 
any data identity. We believe this support is huge for mahout users. Just 
imagine when you need to specify pairwise affinities of petabyte data. Asking 
user to map data point first and specify row/column id is inconvenient.

We response the comments and wait for the further discussion. There are two 
options here. One, if there is a way to use standard input format to implement 
this support, please suggest, because we thought it is impossible. Two, if you 
think this support is useless, we don't mind to remove it and keep with 
ourselves. 

Again, we need discussion to move forward.

Sent from my iPhone



                
> Improve the accuracy of the Spectral KMeans Method
> --------------------------------------------------
>
>                 Key: MAHOUT-1214
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.7
>         Environment: Mahout 0.7
>            Reporter: Yiqun Hu
>            Assignee: Robin Anil
>              Labels: clustering, improvement
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to