[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread zhang da (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhang da updated MAHOUT-1214:
-

Attachment: MAHOUT-1214.patch

removed the string input format, let's not include in this fix then.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread zhang da (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhang da updated MAHOUT-1214:
-

Attachment: (was: MAHOUT-1214.patch)

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-19 Thread zhang da (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhang da updated MAHOUT-1214:
-

Attachment: MAHOUT-1214.patch

updated version, same uploaded to reviewboard

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-19 Thread zhang da (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhang da updated MAHOUT-1214:
-

Attachment: (was: MAHOUT-1214.patch)

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-18 Thread zhang da (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686529#comment-13686529
 ] 

zhang da commented on MAHOUT-1214:
--

i believe the dot product is a false alarm and the problem is in our patch. let 
me fix it and update the patch tonight.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-17 Thread zhang da (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686355#comment-13686355
 ] 

zhang da commented on MAHOUT-1214:
--

thanks. Review Request #11931

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-17 Thread zhang da (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686341#comment-13686341
 ] 

zhang da commented on MAHOUT-1214:
--

i uploaded the patch to the reviewboard, but it has to be assigned to someone 
or some group as reviewer before i can publish, who am i supposed to assign to? 

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-16 Thread zhang da (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhang da updated MAHOUT-1214:
-

Attachment: MAHOUT-1214.patch

patch uploaded

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-13 Thread zhang da (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682240#comment-13682240
 ] 

zhang da commented on MAHOUT-1214:
--

hi, I finish the patch against v0.8 head today, but somehow it breaks one of 
the DistributedLanczosSolver unit test case. Think need a day or two to fix 
that. i'll keep you posted.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-13 Thread zhang da (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhang da updated MAHOUT-1214:
-

Attachment: (was: MAHOUT-1214.patch)

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-12 Thread zhang da (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhang da updated MAHOUT-1214:
-

Attachment: MAHOUT-1214.patch

1. removed the commented out codes
2. used the latest formatter
3. add in unit tests, TestAffinityMatrixGenericInputJob.java
TestEigenSeedGenerator.java

it's based on v0.7 branch 

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, matrix_1, matrix_2, 
> SpectralKMeans.patch
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-09 Thread zhang da (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679340#comment-13679340
 ] 

zhang da commented on MAHOUT-1214:
--

regarding to point 2, I notice that the project(version 0.7 branch) is not 100% 
following the lucene coding style. I could format the files I have modified in 
eclipse, but that would introduce quite a number of unnecessary discrepancies. 
Is that ok?

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: matrix_1, matrix_2, SpectralKMeans.patch
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira