[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-27 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1214:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1270) Broken link on Developer Resources page

2013-06-27 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-1270.


Resolution: Fixed
  Assignee: Robin Anil

> Broken link on Developer Resources page
> ---
>
> Key: MAHOUT-1270
> URL: https://issues.apache.org/jira/browse/MAHOUT-1270
> Project: Mahout
>  Issue Type: Bug
>  Components: Website
>Reporter: Erhan Bagdemir
>Assignee: Robin Anil
>Priority: Minor
>
> The link "How to contribute" on the page
> https://cwiki.apache.org/confluence/display/MAHOUT/Developer+Resources
> is broken :-| 
> https://cwiki.apache.org/MAHOUT/how-to-contribute.html returns 404. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692518#comment-13692518
 ] 

Robin Anil commented on MAHOUT-1214:


https://reviews.apache.org/r/11931/

I have actually replied to your comments. My comment still stands with respect 
to using a non standard input format. Grant, can you take a look as well. 

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-17 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686349#comment-13686349
 ] 

Robin Anil commented on MAHOUT-1214:


Ignore that. That was an issue in the test. The code seems correct.

Zhang, you can assign it to me.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-17 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686343#comment-13686343
 ] 

Robin Anil commented on MAHOUT-1214:


[~dfilimon] Can you take a look at the VectorBinaryAggregate code.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-17 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1214:
---

Attachment: MAHOUT-1214.patch

Test case which shows the bug in the new AggregateBinaryFunction

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-17 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686338#comment-13686338
 ] 

Robin Anil commented on MAHOUT-1214:


size() == Cardinality which is the max size not the size of the hashmap.

For vector {3: 0.11, 4: 0.38, 5:0.2}, the cardinality has to be atleast 6

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-17 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686313#comment-13686313
 ] 

Robin Anil commented on MAHOUT-1214:


Can you explain the bug in the math package. Do you have a reproducible test 
case?

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1259) toString() method of SequentialAccessSparseVector has closing brace missing for empty vector

2013-06-12 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1259:
---

Affects Version/s: (was: 1.0)
   0.8

> toString() method of SequentialAccessSparseVector has closing brace missing 
> for empty vector
> 
>
> Key: MAHOUT-1259
> URL: https://issues.apache.org/jira/browse/MAHOUT-1259
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.8
>Reporter: Abhinav M Kulkarni
>Assignee: Robin Anil
>Priority: Trivial
>  Labels: patch
> Fix For: 0.8
>
> Attachments: MAHOUT-1259.patch
>
>
> toString() method of SequentialAccessSparseVector.java in Math module did not 
> have closing brace for empty vectors. If the sparse vector is empty (or newly 
> created), toString() method should return '{}'. Currently it returns '{'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1259) toString() method of SequentialAccessSparseVector has closing brace missing for empty vector

2013-06-12 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1259:
---

Fix Version/s: (was: 1.0)
   0.8

> toString() method of SequentialAccessSparseVector has closing brace missing 
> for empty vector
> 
>
> Key: MAHOUT-1259
> URL: https://issues.apache.org/jira/browse/MAHOUT-1259
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 1.0
>Reporter: Abhinav M Kulkarni
>Assignee: Robin Anil
>Priority: Trivial
>  Labels: patch
> Fix For: 0.8
>
> Attachments: MAHOUT-1259.patch
>
>
> toString() method of SequentialAccessSparseVector.java in Math module did not 
> have closing brace for empty vectors. If the sparse vector is empty (or newly 
> created), toString() method should return '{}'. Currently it returns '{'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1259) toString() method of SequentialAccessSparseVector has closing brace missing for empty vector

2013-06-12 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1259:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> toString() method of SequentialAccessSparseVector has closing brace missing 
> for empty vector
> 
>
> Key: MAHOUT-1259
> URL: https://issues.apache.org/jira/browse/MAHOUT-1259
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 1.0
>Reporter: Abhinav M Kulkarni
>Assignee: Robin Anil
>Priority: Trivial
>  Labels: patch
> Fix For: 1.0
>
> Attachments: MAHOUT-1259.patch
>
>
> toString() method of SequentialAccessSparseVector.java in Math module did not 
> have closing brace for empty vectors. If the sparse vector is empty (or newly 
> created), toString() method should return '{}'. Currently it returns '{'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1259) toString() method of SequentialAccessSparseVector has closing brace missing for empty vector

2013-06-12 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1259:
---

Assignee: Robin Anil

> toString() method of SequentialAccessSparseVector has closing brace missing 
> for empty vector
> 
>
> Key: MAHOUT-1259
> URL: https://issues.apache.org/jira/browse/MAHOUT-1259
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 1.0
>Reporter: Abhinav M Kulkarni
>Assignee: Robin Anil
>Priority: Trivial
>  Labels: patch
> Fix For: 1.0
>
> Attachments: MAHOUT-1259.patch
>
>
> toString() method of SequentialAccessSparseVector.java in Math module did not 
> have closing brace for empty vectors. If the sparse vector is empty (or newly 
> created), toString() method should return '{}'. Currently it returns '{'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-12 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13681873#comment-13681873
 ] 

Robin Anil commented on MAHOUT-1214:


It doesn't merge properly. Grant do you know if there is an easy way to do 
this? Basically one has to do this merge by hand. 

Yinqun since you understand the code better, can you add your tests and fix the 
code in 0.8 to make it pass correctly? 

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, matrix_1, matrix_2, 
> SpectralKMeans.patch
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1254) Final round of cleanup for StreamingKMeans

2013-06-12 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13681869#comment-13681869
 ] 

Robin Anil commented on MAHOUT-1254:


LGTM

> Final round of cleanup for StreamingKMeans
> --
>
> Key: MAHOUT-1254
> URL: https://issues.apache.org/jira/browse/MAHOUT-1254
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Assignee: Dan Filimon
> Attachments: skm.patch
>
>
> Did a bit of tweaking on StreamingKMeans, driver, mapper and reducer to share 
> more code and make it nicer.
> Need to put this in.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1259) toString() method of SequentialAccessSparseVector has closing brace missing for empty vector

2013-06-12 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1259:
---

Status: Open  (was: Patch Available)

> toString() method of SequentialAccessSparseVector has closing brace missing 
> for empty vector
> 
>
> Key: MAHOUT-1259
> URL: https://issues.apache.org/jira/browse/MAHOUT-1259
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 1.0
>Reporter: Abhinav M Kulkarni
>Priority: Trivial
>  Labels: patch
> Fix For: 1.0
>
>
> toString() method of SequentialAccessSparseVector.java in Math module did not 
> have closing brace for empty vectors. If the sparse vector is empty (or newly 
> created), toString() method should return '{}'. Currently it returns '{'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1259) toString() method of SequentialAccessSparseVector has closing brace missing for empty vector

2013-06-12 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13681867#comment-13681867
 ] 

Robin Anil commented on MAHOUT-1259:


I dont see any patches attached

> toString() method of SequentialAccessSparseVector has closing brace missing 
> for empty vector
> 
>
> Key: MAHOUT-1259
> URL: https://issues.apache.org/jira/browse/MAHOUT-1259
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 1.0
>Reporter: Abhinav M Kulkarni
>Priority: Trivial
>  Labels: patch
> Fix For: 1.0
>
>
> toString() method of SequentialAccessSparseVector.java in Math module did not 
> have closing brace for empty vectors. If the sparse vector is empty (or newly 
> created), toString() method should return '{}'. Currently it returns '{'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-12 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13681864#comment-13681864
 ] 

Robin Anil commented on MAHOUT-1214:


Patch doesnt apply cleanly on head. Can you sync your subversion to HEAD before 
making a patch.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, matrix_1, matrix_2, 
> SpectralKMeans.patch
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1253) Add experiment tools for StreamingKMeans

2013-06-11 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680651#comment-13680651
 ] 

Robin Anil commented on MAHOUT-1253:


Please also add it to the the examples/bin/cluster* shell script examples

> Add experiment tools for StreamingKMeans
> 
>
> Key: MAHOUT-1253
> URL: https://issues.apache.org/jira/browse/MAHOUT-1253
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Assignee: Dan Filimon
>
> Merge in this patch https://reviews.apache.org/r/11302/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-09 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679346#comment-13679346
 ] 

Robin Anil commented on MAHOUT-1214:


Not at all. Please feel free to to clean up and fix things.

The latest codeformatters are available in svn, use that.
http://svn.apache.org/viewvc/mahout/trunk/buildtools/?pathrev=1491304

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: matrix_1, matrix_2, SpectralKMeans.patch
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-09 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1214:
---

Fix Version/s: (was: Backlog)
   0.8

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: matrix_1, matrix_2, SpectralKMeans.patch
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-09 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679327#comment-13679327
 ] 

Robin Anil commented on MAHOUT-1214:


1) I see a lot of code commented out. Make a patch with only the corrected 
change without any code commented out.
2) Please format the code using the Eclipse code formatter (see the last 
section of https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute)
3) Please add your example as a junit Test. 

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: Backlog
>
> Attachments: matrix_1, matrix_2, SpectralKMeans.patch
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-09 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1214:
---

Assignee: Robin Anil

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: Backlog
>
> Attachments: matrix_1, matrix_2, SpectralKMeans.patch
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1241) Mailing list archives not available

2013-06-09 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679209#comment-13679209
 ] 

Robin Anil commented on MAHOUT-1241:


The page should not exist anymore. Force refresh

--
Robin Anil






> Mailing list archives not available
> ---
>
> Key: MAHOUT-1241
> URL: https://issues.apache.org/jira/browse/MAHOUT-1241
> Project: Mahout
>  Issue Type: Bug
>  Components: Website
>Reporter: Zeno Gantner
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1241.patch
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/
> give me a 404 error.
> These are the mailing lists archives are linked from here:
> http://mahout.apache.org/mailinglists.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1246) Bring Mahout website from the 2000s to 2010s.

2013-06-08 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1246:
---

Fix Version/s: 0.8

> Bring Mahout website from the 2000s to 2010s.
> -
>
> Key: MAHOUT-1246
> URL: https://issues.apache.org/jira/browse/MAHOUT-1246
> Project: Mahout
>  Issue Type: Bug
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.8
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1191) Cleanup Vector Benchmarks make it less variable

2013-06-08 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1191:
---

Fix Version/s: 0.8

> Cleanup Vector Benchmarks make it less variable
> ---
>
> Key: MAHOUT-1191
> URL: https://issues.apache.org/jira/browse/MAHOUT-1191
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.8
>
> Attachments: MAHOUT-1191.patch, MAHOUT-1191.patch, MAHOUT-1191.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1192) Speed up Vector Operations

2013-06-08 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1192:
---

Fix Version/s: 0.8

> Speed up Vector Operations
> --
>
> Key: MAHOUT-1192
> URL: https://issues.apache.org/jira/browse/MAHOUT-1192
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.8
>
> Attachments: MAHOUT-1192.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1246) Bring Mahout website from the 2000s to 2010s.

2013-06-08 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-1246.


Resolution: Fixed

Done!
Humanist fonts, widescreen, twitter, CSS. 

> Bring Mahout website from the 2000s to 2010s.
> -
>
> Key: MAHOUT-1246
> URL: https://issues.apache.org/jira/browse/MAHOUT-1246
> Project: Mahout
>  Issue Type: Bug
>Reporter: Robin Anil
>Assignee: Robin Anil
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1246) Bring Mahout website from the 2000s to 2010s.

2013-06-08 Thread Robin Anil (JIRA)
Robin Anil created MAHOUT-1246:
--

 Summary: Bring Mahout website from the 2000s to 2010s.
 Key: MAHOUT-1246
 URL: https://issues.apache.org/jira/browse/MAHOUT-1246
 Project: Mahout
  Issue Type: Bug
Reporter: Robin Anil
Assignee: Robin Anil




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1245) Move Website(s) to ASF CMS

2013-06-08 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-1245.


Resolution: Not A Problem
  Assignee: Robin Anil

> Move Website(s) to ASF CMS
> --
>
> Key: MAHOUT-1245
> URL: https://issues.apache.org/jira/browse/MAHOUT-1245
> Project: Mahout
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Robin Anil
> Fix For: 0.9
>
>
> The ASF CMS makes editing sites a whole lot easier using pub-sub and Markdown.
> We should move to it.  We will be much happier.  I'd even propose we move 
> most of our wiki to it and let users comment instead of edit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1241) Mailing list archives not available

2013-06-08 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-1241.


Resolution: Fixed

> Mailing list archives not available
> ---
>
> Key: MAHOUT-1241
> URL: https://issues.apache.org/jira/browse/MAHOUT-1241
> Project: Mahout
>  Issue Type: Bug
>  Components: Website
>Reporter: Zeno Gantner
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1241.patch
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/
> give me a 404 error.
> These are the mailing lists archives are linked from here:
> http://mahout.apache.org/mailinglists.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1241) Mailing list archives not available

2013-06-07 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678617#comment-13678617
 ] 

Robin Anil commented on MAHOUT-1241:


Would appreciate an extra pair of eye on this. [~gsingers] [~srowen]

> Mailing list archives not available
> ---
>
> Key: MAHOUT-1241
> URL: https://issues.apache.org/jira/browse/MAHOUT-1241
> Project: Mahout
>  Issue Type: Bug
>  Components: Website
>Reporter: Zeno Gantner
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1241.patch
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/
> give me a 404 error.
> These are the mailing lists archives are linked from here:
> http://mahout.apache.org/mailinglists.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1241) Mailing list archives not available

2013-06-07 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1241:
---

Assignee: Robin Anil

> Mailing list archives not available
> ---
>
> Key: MAHOUT-1241
> URL: https://issues.apache.org/jira/browse/MAHOUT-1241
> Project: Mahout
>  Issue Type: Bug
>  Components: Website
>Reporter: Zeno Gantner
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1241.patch
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/
> give me a 404 error.
> These are the mailing lists archives are linked from here:
> http://mahout.apache.org/mailinglists.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1241) Mailing list archives not available

2013-06-07 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1241:
---

Attachment: MAHOUT-1241.patch

Add .htaccess to redirect everyone to main page.

Delete the old links.

> Mailing list archives not available
> ---
>
> Key: MAHOUT-1241
> URL: https://issues.apache.org/jira/browse/MAHOUT-1241
> Project: Mahout
>  Issue Type: Bug
>  Components: Website
>Reporter: Zeno Gantner
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1241.patch
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/
> give me a 404 error.
> These are the mailing lists archives are linked from here:
> http://mahout.apache.org/mailinglists.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1240) Randomized testing and Serialization of NonZeros

2013-06-04 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1240:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Randomized testing and Serialization of NonZeros
> 
>
> Key: MAHOUT-1240
> URL: https://issues.apache.org/jira/browse/MAHOUT-1240
> Project: Mahout
>  Issue Type: Bug
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.8
>
> Attachments: MAHOUT-1240.patch
>
>
> Currently the nonZero iterator does not guarantee nonZero iteration for 
> certain vectors (RASV, SASV) for performance reason. However vector view 
> iterator adds a zero check.. To be correct we have to either remove the check 
> or do correct non zero serialization everywhere. However this means going 
> over the vectors in two passes. Given that is pretty fast already, I am 
> fixing the logic bug. We can tackle the speed up for the next release.
> This also adds a randomized test for serialization that catches all such bugs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1240) Randomized testing and Serialization of NonZeros

2013-06-04 Thread Robin Anil (JIRA)
Robin Anil created MAHOUT-1240:
--

 Summary: Randomized testing and Serialization of NonZeros
 Key: MAHOUT-1240
 URL: https://issues.apache.org/jira/browse/MAHOUT-1240
 Project: Mahout
  Issue Type: Bug
Reporter: Robin Anil
 Fix For: 0.8


Currently the nonZero iterator does not guarantee nonZero iteration for certain 
vectors (RASV, SASV) for performance reason. However vector view iterator adds 
a zero check.. To be correct we have to either remove the check or do correct 
non zero serialization everywhere. However this means going over the vectors in 
two passes. Given that is pretty fast already, I am fixing the logic bug. We 
can tackle the speed up for the next release.

This also adds a randomized test for serialization that catches all such bugs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1240) Randomized testing and Serialization of NonZeros

2013-06-04 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1240:
---

Attachment: MAHOUT-1240.patch

> Randomized testing and Serialization of NonZeros
> 
>
> Key: MAHOUT-1240
> URL: https://issues.apache.org/jira/browse/MAHOUT-1240
> Project: Mahout
>  Issue Type: Bug
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.8
>
> Attachments: MAHOUT-1240.patch
>
>
> Currently the nonZero iterator does not guarantee nonZero iteration for 
> certain vectors (RASV, SASV) for performance reason. However vector view 
> iterator adds a zero check.. To be correct we have to either remove the check 
> or do correct non zero serialization everywhere. However this means going 
> over the vectors in two passes. Given that is pretty fast already, I am 
> fixing the logic bug. We can tackle the speed up for the next release.
> This also adds a randomized test for serialization that catches all such bugs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1240) Randomized testing and Serialization of NonZeros

2013-06-04 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1240:
---

Status: Patch Available  (was: Open)

> Randomized testing and Serialization of NonZeros
> 
>
> Key: MAHOUT-1240
> URL: https://issues.apache.org/jira/browse/MAHOUT-1240
> Project: Mahout
>  Issue Type: Bug
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.8
>
> Attachments: MAHOUT-1240.patch
>
>
> Currently the nonZero iterator does not guarantee nonZero iteration for 
> certain vectors (RASV, SASV) for performance reason. However vector view 
> iterator adds a zero check.. To be correct we have to either remove the check 
> or do correct non zero serialization everywhere. However this means going 
> over the vectors in two passes. Given that is pretty fast already, I am 
> fixing the logic bug. We can tackle the speed up for the next release.
> This also adds a randomized test for serialization that catches all such bugs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-1240) Randomized testing and Serialization of NonZeros

2013-06-04 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil reassigned MAHOUT-1240:
--

Assignee: Robin Anil

> Randomized testing and Serialization of NonZeros
> 
>
> Key: MAHOUT-1240
> URL: https://issues.apache.org/jira/browse/MAHOUT-1240
> Project: Mahout
>  Issue Type: Bug
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.8
>
>
> Currently the nonZero iterator does not guarantee nonZero iteration for 
> certain vectors (RASV, SASV) for performance reason. However vector view 
> iterator adds a zero check.. To be correct we have to either remove the check 
> or do correct non zero serialization everywhere. However this means going 
> over the vectors in two passes. Given that is pretty fast already, I am 
> fixing the logic bug. We can tackle the speed up for the next release.
> This also adds a randomized test for serialization that catches all such bugs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1238:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Tested and Submitted 

> VectorWritable's bug with VectorView of sparse vectors
> --
>
> Key: MAHOUT-1238
> URL: https://issues.apache.org/jira/browse/MAHOUT-1238
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7, 0.8
>Reporter: Maysam Yabandeh
>Assignee: Robin Anil
>  Labels: reduce, test
> Fix For: 0.8, 0.7
>
> Attachments: MAHOUT-1238.patch, MAHOUT-1238.patch
>
>
> VectorWritable raises an exception if it is used on a VectorView of a sparse 
> vector. The reason is that the sparse vector writes only the non-zero 
> elements, while VectorView's implementation of getNumNondefaultElements() 
> returns the size of the entire data. Later when reading the vector, 
> VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673114#comment-13673114
 ] 

Robin Anil commented on MAHOUT-1238:


Also I see test failures. Dont change the behavior of getNumNonDefaultElement, 
instead try to correct the serialization code to write the correct size.

> VectorWritable's bug with VectorView of sparse vectors
> --
>
> Key: MAHOUT-1238
> URL: https://issues.apache.org/jira/browse/MAHOUT-1238
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7, 0.8
>Reporter: Maysam Yabandeh
>Assignee: Robin Anil
>  Labels: reduce, test
> Fix For: 0.7, 0.8
>
> Attachments: MAHOUT-1238.patch
>
>
> VectorWritable raises an exception if it is used on a VectorView of a sparse 
> vector. The reason is that the sparse vector writes only the non-zero 
> elements, while VectorView's implementation of getNumNondefaultElements() 
> returns the size of the entire data. Later when reading the vector, 
> VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-976) Implement Multilayer Perceptron

2013-06-03 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-976:
--

Fix Version/s: (was: 0.8)
   Backlog

> Implement Multilayer Perceptron
> ---
>
> Key: MAHOUT-976
> URL: https://issues.apache.org/jira/browse/MAHOUT-976
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.7
>Reporter: Christian Herta
>Assignee: Ted Dunning
>Priority: Minor
>  Labels: multilayer, networks, neural, perceptron
> Fix For: Backlog
>
> Attachments: MAHOUT-976.patch, MAHOUT-976.patch, MAHOUT-976.patch, 
> MAHOUT-976.patch
>
>   Original Estimate: 80h
>  Remaining Estimate: 80h
>
> Implement a multi layer perceptron
>  * via Matrix Multiplication
>  * Learning by Backpropagation; implementing tricks by Yann LeCun et al.: 
> "Efficent Backprop"
>  * arbitrary number of hidden layers (also 0  - just the linear model)
>  * connection between proximate layers only 
>  * different cost and activation functions (different activation function in 
> each layer) 
>  * test of backprop by gradient checking 
>  * normalization of the inputs (storeable) as part of the model
>  
> First:
>  * implementation "stocastic gradient descent" like gradient machine
>  * simple gradient descent incl. momentum
> Later (new jira issues):  
>  * Distributed Batch learning (see below)  
>  * "Stacked (Denoising) Autoencoder" - Feature Learning
>  * advanced cost minimazation like 2nd order methods, conjugate gradient etc.
> Distribution of learning can be done by (batch learning):
>  1 Partioning of the data in x chunks 
>  2 Learning the weight changes as matrices in each chunk
>  3 Combining the matrixes and update of the weights - back to 2
> Maybe this procedure can be done with random parts of the chunks (distributed 
> quasi online learning). 
> Batch learning with delta-bar-delta heuristics for adapting the learning 
> rates.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-976) Implement Multilayer Perceptron

2013-06-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673111#comment-13673111
 ] 

Robin Anil commented on MAHOUT-976:
---

I see a few system.out.println() please remove those. Also use the Mahout 
eclipse code formatter to format the files. [~chrisberlin] will you be able to 
work on these quickly? I am pushing it off the 0.8 list. If you can work on it, 
please update it and we will review it.

> Implement Multilayer Perceptron
> ---
>
> Key: MAHOUT-976
> URL: https://issues.apache.org/jira/browse/MAHOUT-976
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.7
>Reporter: Christian Herta
>Assignee: Ted Dunning
>Priority: Minor
>  Labels: multilayer, networks, neural, perceptron
> Fix For: 0.8
>
> Attachments: MAHOUT-976.patch, MAHOUT-976.patch, MAHOUT-976.patch, 
> MAHOUT-976.patch
>
>   Original Estimate: 80h
>  Remaining Estimate: 80h
>
> Implement a multi layer perceptron
>  * via Matrix Multiplication
>  * Learning by Backpropagation; implementing tricks by Yann LeCun et al.: 
> "Efficent Backprop"
>  * arbitrary number of hidden layers (also 0  - just the linear model)
>  * connection between proximate layers only 
>  * different cost and activation functions (different activation function in 
> each layer) 
>  * test of backprop by gradient checking 
>  * normalization of the inputs (storeable) as part of the model
>  
> First:
>  * implementation "stocastic gradient descent" like gradient machine
>  * simple gradient descent incl. momentum
> Later (new jira issues):  
>  * Distributed Batch learning (see below)  
>  * "Stacked (Denoising) Autoencoder" - Feature Learning
>  * advanced cost minimazation like 2nd order methods, conjugate gradient etc.
> Distribution of learning can be done by (batch learning):
>  1 Partioning of the data in x chunks 
>  2 Learning the weight changes as matrices in each chunk
>  3 Combining the matrixes and update of the weights - back to 2
> Maybe this procedure can be done with random parts of the chunks (distributed 
> quasi online learning). 
> Batch learning with delta-bar-delta heuristics for adapting the learning 
> rates.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673108#comment-13673108
 ] 

Robin Anil commented on MAHOUT-1238:


There is a getNumNonZeroElements() method in AbstractVector try using that.

> VectorWritable's bug with VectorView of sparse vectors
> --
>
> Key: MAHOUT-1238
> URL: https://issues.apache.org/jira/browse/MAHOUT-1238
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7, 0.8
>Reporter: Maysam Yabandeh
>Assignee: Robin Anil
>  Labels: reduce, test
> Fix For: 0.7, 0.8
>
> Attachments: MAHOUT-1238.patch
>
>
> VectorWritable raises an exception if it is used on a VectorView of a sparse 
> vector. The reason is that the sparse vector writes only the non-zero 
> elements, while VectorView's implementation of getNumNondefaultElements() 
> returns the size of the entire data. Later when reading the vector, 
> VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)

2013-06-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673072#comment-13673072
 ] 

Robin Anil commented on MAHOUT-1225:


Yes I saw that, thats why I committed your patch. I thought there was something 
else 

http://svn.apache.org/viewvc/mahout/trunk/math/src/main/java-templates/org/apache/mahout/math/map/OpenObjectValueTypeHashMap.java.t?r1=1488607&r2=1488606&pathrev=1488607

> Sets and maps incorrectly clear() their state arrays (potential endless loops)
> --
>
> Key: MAHOUT-1225
> URL: https://issues.apache.org/jira/browse/MAHOUT-1225
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
> Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths 
> collections (Set) 0.7, hppc 0.4.3
>Reporter: Sophie Sperner
>Assignee: Dawid Weiss
>  Labels: hashset, java, mahout, test
> Fix For: 0.7
>
> Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, 
> MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The code I attached hangs on forever, Eclipse does not print me its stack 
> trace because it does not terminate the program. So I decided to make a small 
> test.java file that you can easily run.
> This code has the main function that simply runs getItemList() method which 
> successfully executes getDataset() method (here please download mushroom.dat 
> dataset and set the full path into filePath string variable) and the hangs on 
> (the problem happens on a fourth columnValues.add() call). After the dataset 
> was taken into X array, the code simply goes through X column by column and 
> searches for different items in it.
> If you uncomment IntSet columnValues = new IntOpenHashSet(); and 
> corresponding import headers then everything will work just fine (you will 
> also need to include hppc jar file found here 
> http://labs.carrotsearch.com/hppc.html or below in the attachment).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1238:
---

Assignee: Robin Anil

> VectorWritable's bug with VectorView of sparse vectors
> --
>
> Key: MAHOUT-1238
> URL: https://issues.apache.org/jira/browse/MAHOUT-1238
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7, 0.8
>Reporter: Maysam Yabandeh
>Assignee: Robin Anil
>  Labels: reduce, test
> Fix For: 0.7, 0.8
>
> Attachments: MAHOUT-1238.patch
>
>
> VectorWritable raises an exception if it is used on a VectorView of a sparse 
> vector. The reason is that the sparse vector writes only the non-zero 
> elements, while VectorView's implementation of getNumNondefaultElements() 
> returns the size of the entire data. Later when reading the vector, 
> VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)

2013-06-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672909#comment-13672909
 ] 

Robin Anil commented on MAHOUT-1225:


Could you elaborate on the buggy scenario. I dont see an option to reopen 
myself.

> Sets and maps incorrectly clear() their state arrays (potential endless loops)
> --
>
> Key: MAHOUT-1225
> URL: https://issues.apache.org/jira/browse/MAHOUT-1225
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
> Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths 
> collections (Set) 0.7, hppc 0.4.3
>Reporter: Sophie Sperner
>Assignee: Dawid Weiss
>  Labels: hashset, java, mahout, test
> Fix For: 0.7
>
> Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, 
> MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The code I attached hangs on forever, Eclipse does not print me its stack 
> trace because it does not terminate the program. So I decided to make a small 
> test.java file that you can easily run.
> This code has the main function that simply runs getItemList() method which 
> successfully executes getDataset() method (here please download mushroom.dat 
> dataset and set the full path into filePath string variable) and the hangs on 
> (the problem happens on a fourth columnValues.add() call). After the dataset 
> was taken into X array, the code simply goes through X column by column and 
> searches for different items in it.
> If you uncomment IntSet columnValues = new IntOpenHashSet(); and 
> corresponding import headers then everything will work just fine (you will 
> also need to include hppc jar file found here 
> http://labs.carrotsearch.com/hppc.html or below in the attachment).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-928) Add the ARFF data loader/converter on DF

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-928:
--

Fix Version/s: (was: 0.8)
   Backlog

> Add the ARFF data loader/converter on DF
> 
>
> Key: MAHOUT-928
> URL: https://issues.apache.org/jira/browse/MAHOUT-928
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Ikumasa Mukai
>Assignee: Ikumasa Mukai
>  Labels: DecisionForest
> Fix For: Backlog
>
>
> ArffDataLoader, ArffData, ArffInvalidFormatException are made for checking 
> the model on MAHOUT-840.
> I think this function improves usability. (now we have to remove headers from 
> arff data before using mahout)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1164) Make ARFF integration generate meta-data in JSON format

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1164:
---

Fix Version/s: 0.8

> Make ARFF integration generate meta-data in JSON format
> ---
>
> Key: MAHOUT-1164
> URL: https://issues.apache.org/jira/browse/MAHOUT-1164
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, Integration
>Affects Versions: 0.7
>Reporter: Marty Kube
>Assignee: Ted Dunning
> Fix For: 0.8
>
> Attachments: MAHOUT-1164.patch, MAHOUT-1164.patch, MAHOUT-1164.patch
>
>
> Add a command line option to generate meta-data in a JSON format.
> This ticket supports the larger goal of making RF classifiers consume the 
> meta-data and sequence files generated by the integration components.  
> MAHOUT-1163 makes RF classifiers read JSON meta-data.  This ticket is to get 
> the ARFF integration to generate the same JSON formatted meta-data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-976) Implement Multilayer Perceptron

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-976:
--

Fix Version/s: 0.8

> Implement Multilayer Perceptron
> ---
>
> Key: MAHOUT-976
> URL: https://issues.apache.org/jira/browse/MAHOUT-976
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.7
>Reporter: Christian Herta
>Assignee: Ted Dunning
>Priority: Minor
>  Labels: multilayer, networks, neural, perceptron
> Fix For: 0.8
>
> Attachments: MAHOUT-976.patch, MAHOUT-976.patch, MAHOUT-976.patch, 
> MAHOUT-976.patch
>
>   Original Estimate: 80h
>  Remaining Estimate: 80h
>
> Implement a multi layer perceptron
>  * via Matrix Multiplication
>  * Learning by Backpropagation; implementing tricks by Yann LeCun et al.: 
> "Efficent Backprop"
>  * arbitrary number of hidden layers (also 0  - just the linear model)
>  * connection between proximate layers only 
>  * different cost and activation functions (different activation function in 
> each layer) 
>  * test of backprop by gradient checking 
>  * normalization of the inputs (storeable) as part of the model
>  
> First:
>  * implementation "stocastic gradient descent" like gradient machine
>  * simple gradient descent incl. momentum
> Later (new jira issues):  
>  * Distributed Batch learning (see below)  
>  * "Stacked (Denoising) Autoencoder" - Feature Learning
>  * advanced cost minimazation like 2nd order methods, conjugate gradient etc.
> Distribution of learning can be done by (batch learning):
>  1 Partioning of the data in x chunks 
>  2 Learning the weight changes as matrices in each chunk
>  3 Combining the matrixes and update of the weights - back to 2
> Maybe this procedure can be done with random parts of the chunks (distributed 
> quasi online learning). 
> Batch learning with delta-bar-delta heuristics for adapting the learning 
> rates.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1233:
---

Fix Version/s: 0.8

> Problem in processing datasets as a single chunk vs many chunks in HADOOP 
> mode in mostly all the clustering algos
> -
>
> Key: MAHOUT-1233
> URL: https://issues.apache.org/jira/browse/MAHOUT-1233
> Project: Mahout
>  Issue Type: Question
>  Components: Clustering
>Affects Versions: 0.7, 0.8
>Reporter: yannis ats
>Assignee: yannis ats
>Priority: Minor
> Fix For: 0.8
>
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many 
> smaller chunks in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results 
> are fine 
> and by fine i mean that if i have in the input 1000 vectors i get in the 
> output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans 
> and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then 
> strange phenomena occur.
> For instance the same dataset that contains 1000 vectors and is split in  for 
> example 10 files then in the output i will obtain more vector ids(w.g 1100 
> vectorids with their corresponding clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many 
> files?
> I have observed when mahout is performing the computations that in the screen 
> says that processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-928) Add the ARFF data loader/converter on DF

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-928:
--

Fix Version/s: 0.8

> Add the ARFF data loader/converter on DF
> 
>
> Key: MAHOUT-928
> URL: https://issues.apache.org/jira/browse/MAHOUT-928
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Ikumasa Mukai
>Assignee: Ikumasa Mukai
>  Labels: DecisionForest
> Fix For: 0.8
>
>
> ArffDataLoader, ArffData, ArffInvalidFormatException are made for checking 
> the model on MAHOUT-840.
> I think this function improves usability. (now we have to remove headers from 
> arff data before using mahout)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-958) NullPointerException in RepresentativePointsMapper when running cluster-reuters.sh example with kmeans

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-958:
--

Fix Version/s: 0.8

> NullPointerException in RepresentativePointsMapper when running 
> cluster-reuters.sh example with kmeans
> --
>
> Key: MAHOUT-958
> URL: https://issues.apache.org/jira/browse/MAHOUT-958
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.6
> Environment: {code}
> > uname -a
> Linux 3.2.1-3.fc16.x86_64 #1 SMP Mon Jan 23 15:36:17 UTC 2012 x86_64 x86_64 
> x86_64 GNU/Linux
> {code}
> {code}
> > java -version
> java version "1.7.0_02"
> Java(TM) SE Runtime Environment (build 1.7.0_02-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 22.0-b10, mixed mode)
> {code}
> Hadoop Version: 0.20.203.0, r1099333
>Reporter: Rares Vernica
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
> Attachments: MAHOUT-958.patch
>
>
> {code}
> > svn info
> Path: .
> URL: http://svn.apache.org/repos/asf/mahout/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 1235544
> Node Kind: directory
> Schedule: normal
> Last Changed Author: tdunning
> Last Changed Rev: 1231800
> Last Changed Date: 2012-01-15 16:01:38 -0800 (Sun, 15 Jan 2012)
> {code}
> {code}
> > ./examples/bin/cluster-reuters.sh
> ...
> 1. kmeans clustering
> ...
> Inter-Cluster Density: NaN
> Intra-Cluster Density: 0.0
> CDbw Inter-Cluster Density: 0.0
> CDbw Intra-Cluster Density: NaN
> CDbw Separation: 0.0
> 12/01/24 16:08:47 INFO clustering.ClusterDumper: Wrote 20 clusters
> 12/01/24 16:08:47 INFO driver.MahoutDriver: Program took 126749 ms (Minutes: 
> 2.11248335)
> {code}
> All five "{{Representative Points Driver}}" jobs fail.
> {code}
> 2012-01-24 16:07:11,555 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded 
> the native-hadoop library
> 2012-01-24 16:07:11,881 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
> 100
> 2012-01-24 16:07:11,896 INFO org.apache.hadoop.mapred.MapTask: data buffer = 
> 79691776/99614720
> 2012-01-24 16:07:11,896 INFO org.apache.hadoop.mapred.MapTask: record buffer 
> = 262144/327680
> 2012-01-24 16:07:11,956 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2012-01-24 16:07:11,979 INFO org.apache.hadoop.io.nativeio.NativeIO: 
> Initialized cache for UID to User mapping with a cache timeout of 14400 
> seconds.
> 2012-01-24 16:07:11,979 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
> UserName vernica for UID 1000 from the native implementation
> 2012-01-24 16:07:11,981 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.lang.NullPointerException
>   at 
> org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.mapPoint(RepresentativePointsMapper.java:73)
>   at 
> org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.map(RepresentativePointsMapper.java:60)
>   at 
> org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.map(RepresentativePointsMapper.java:40)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>   at org.apache.hadoop.mapred.Child.main(Child.java:253)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1237) Total cost isn't computed properly

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1237:
---

Fix Version/s: 0.8

> Total cost isn't computed properly
> --
>
> Key: MAHOUT-1237
> URL: https://issues.apache.org/jira/browse/MAHOUT-1237
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Assignee: Dan Filimon
>Priority: Minor
> Fix For: 0.8
>
>
> The problem is that it adds up cluster weights instead of computing the sum 
> of all the distances.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1098) ColumnMeansJob broken

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1098:
---

Fix Version/s: 0.8

> ColumnMeansJob broken
> -
>
> Key: MAHOUT-1098
> URL: https://issues.apache.org/jira/browse/MAHOUT-1098
> Project: Mahout
>  Issue Type: Bug
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 0.8
>
>
> getting various errors, e.g.
> java.lang.IllegalStateException: java.lang.ClassNotFoundException: 
> DistributedRowMatrix.columnMeans.vector.class
>   at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:38)
>   at 
> org.apache.mahout.math.hadoop.MatrixColumnMeansJob$MatrixColumnMeansMapper.map(MatrixColumnMeansJob.java:159)
>   at 
> org.apache.mahout.math.hadoop.MatrixColumnMeansJob$MatrixColumnMeansMapper.map(MatrixColumnMeansJob.java:134)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>   at org.apache.hadoop.mapred.Child.main(Child.java:264)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1233:
---

Assignee: yannis ats

> Problem in processing datasets as a single chunk vs many chunks in HADOOP 
> mode in mostly all the clustering algos
> -
>
> Key: MAHOUT-1233
> URL: https://issues.apache.org/jira/browse/MAHOUT-1233
> Project: Mahout
>  Issue Type: Question
>  Components: Clustering
>Affects Versions: 0.7, 0.8
>Reporter: yannis ats
>Assignee: yannis ats
>Priority: Minor
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many 
> smaller chunks in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results 
> are fine 
> and by fine i mean that if i have in the input 1000 vectors i get in the 
> output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans 
> and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then 
> strange phenomena occur.
> For instance the same dataset that contains 1000 vectors and is split in  for 
> example 10 files then in the output i will obtain more vector ids(w.g 1100 
> vectorids with their corresponding clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many 
> files?
> I have observed when mahout is performing the computations that in the screen 
> says that processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1084) Kmeans for synthetic control example--there are 12 cluster during iterations.

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1084:
---

Assignee: Robin Anil

> Kmeans for synthetic control example--there are 12 cluster during iterations.
> -
>
> Key: MAHOUT-1084
> URL: https://issues.apache.org/jira/browse/MAHOUT-1084
> Project: Mahout
>  Issue Type: Bug
>Reporter: liutengfei
>Assignee: Robin Anil
> Fix For: 0.8
>
>
>In Mahout-Kmeans for syntheticcontrol example, using the default 
> parameters means to compute 6 clusters at last. But why there are 12 clusters 
> during Kmeans iterations. According to my observation, the former 6 clusters 
> and the latter 6 clusters are the same before the first iteration,those 6 
> clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will 
> assign its own points to this 12 clusters. Is here existing logical errors?
>The 12 clusters are created by the function "setup" in CIMapper.java, 
> more specifically, is the line "classifier.readFromSeqFiles(conf, new 
> Path(priorClustersPath));", here the "priorClustersPath" means hdfs direction 
> "output/clusters-0/", there are 8 files in this direction: 
> "_policy","part-randomSeed"(one file record six cluster),"part-0" to 
> "part-5"(total six files,every one record a cluster), while reading this 
> direction, "_policy" will be filtered out, so program will read "part-0" 
> to "part-5" to create six clusters, then read "part-randomSeed" to create 
> the other six clusters, this is the reason why there will be 12 clusters 
> before first iteration.
>   Solution: delete associated code to avoid duplicately creating clusters 
> in "output/clusters-0/", here i delete codes where create files: "part-0" 
> to "part-5" in ClusterClassfier.java:
>   public void writeToSeqFiles(Path path) throws IOException {
> writePolicy(policy, path);
> /*
> Configuration config = new Configuration();
> FileSystem fs = FileSystem.get(path.toUri(), config);
> SequenceFile.Writer writer = null;
> ClusterWritable cw = new ClusterWritable();
> for (int i = 0; i < models.size(); i++) {
>   try {
> Cluster cluster = models.get(i);
> cw.setValue(cluster);
> writer = new SequenceFile.Writer(fs, config,
> new Path(path, "part-" + String.format(Locale.ENGLISH, "%05d", 
> i)), IntWritable.class,
> ClusterWritable.class);
> Writable key = new IntWritable(i);
> writer.append(key, cw);
>   } finally {
> Closeables.closeQuietly(writer);
>   }
> }
> */
>   }
> I don't know if it is still okay for other progams who using this file, 
> but for KMeans in Syntheticcontrol example, program will create 6 clusters 
> during every iterations as i expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1196) LogisticModelParameters uses csv.getTargetCategories() even if csv is not used.

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1196:
---

Assignee: Vineet Krishnan

> LogisticModelParameters uses csv.getTargetCategories() even if csv is not 
> used.
> ---
>
> Key: MAHOUT-1196
> URL: https://issues.apache.org/jira/browse/MAHOUT-1196
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.8
> Environment: All
>Reporter: Vineet Krishnan
>Assignee: Vineet Krishnan
>Priority: Trivial
>  Labels: CSV, Classifier, LogisticModelParameters
> Fix For: 0.8
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> saveTo(OutputStream out) tries to get csv.getTargetCategories() even when it 
> has already been set. In a case when CsvRecordFactory is not used, this gives 
> a NullPointerException when saveTo() is called.
> IMHO a simple null check for targetCategories is sufficient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-928) Add the ARFF data loader/converter on DF

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-928:
--


Please Update the JIRA issue, I am pushing this to the backlog right now.

> Add the ARFF data loader/converter on DF
> 
>
> Key: MAHOUT-928
> URL: https://issues.apache.org/jira/browse/MAHOUT-928
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Ikumasa Mukai
>Assignee: Ikumasa Mukai
>  Labels: DecisionForest
>
> ArffDataLoader, ArffData, ArffInvalidFormatException are made for checking 
> the model on MAHOUT-840.
> I think this function improves usability. (now we have to remove headers from 
> arff data before using mahout)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-975) Bug in Gradient Machine - Computation of the gradient

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-975:
--

Assignee: Ted Dunning

> Bug in Gradient Machine  - Computation of the gradient
> --
>
> Key: MAHOUT-975
> URL: https://issues.apache.org/jira/browse/MAHOUT-975
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
>Reporter: Christian Herta
>Assignee: Ted Dunning
> Fix For: 0.8
>
> Attachments: GradientMachine.patch
>
>
> The initialisation to compute the gradient descent weight updates for the 
> output units should be wrong:
>  
> In the comment: "dy / dw is just w since  y = x' * w + b."
> This is wrong. dy/dw is x (ignoring the indices). The same initialisation is 
> done in the code.
> Check by using neural network terminology:
> The gradient machine is a specialized version of a multi layer perceptron 
> (MLP).
> In a MLP the gradient for computing the "weight change" for the output units 
> is:
> dE / dw_ij = dE / dz_i * dz_i / d_ij with z_i = sum_j (w_ij * a_j)
> here: i index of the output layer; j index of the hidden layer
> (d stands for the partial derivatives)
> here: z_i = a_i (no squashing in the output layer)
> with the special loss (cost function) is  E = 1 - a_g + a_b = 1 - z_g + z_b
> with
> g index of output unit with target value: +1 (positive class)
> b: random output unit with target value: 0
> =>
> dE / dw_gj = -dE/dz_g * dz_g/dw_gj = -1 * a_j (a_j: activity of the hidden 
> unit
> j)
> dE / dw_bj = -dE/dz_b * dz_b/dw_bj = +1 * a_j (a_j: activity of the hidden 
> unit
> j)
> That's the same if the comment would be correct:
> dy /dw = x (x is here the activation of the hidden unit) * (-1) for weights to
> the output unit with target value +1.
> 
> In neural network implementations it's common to compute the gradient
> numerically for a test of the implementation. This can be done by:
> dE/dw_ij = (E(w_ij + epsilon) -E(w_ij - epsilon) ) / (2* (epsilon))

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-928) Add the ARFF data loader/converter on DF

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-928:
--

Assignee: Ikumasa Mukai

> Add the ARFF data loader/converter on DF
> 
>
> Key: MAHOUT-928
> URL: https://issues.apache.org/jira/browse/MAHOUT-928
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Ikumasa Mukai
>Assignee: Ikumasa Mukai
>  Labels: DecisionForest
>
> ArffDataLoader, ArffData, ArffInvalidFormatException are made for checking 
> the model on MAHOUT-840.
> I think this function improves usability. (now we have to remove headers from 
> arff data before using mahout)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-833) Make conversion to sequence files map-reduce

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-833:
--

Assignee: Josh Patterson

> Make conversion to sequence files map-reduce
> 
>
> Key: MAHOUT-833
> URL: https://issues.apache.org/jira/browse/MAHOUT-833
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.7
>Reporter: Grant Ingersoll
>Assignee: Josh Patterson
>  Labels: MAHOUT_INTRO_CONTRIBUTE
> Fix For: 0.8
>
> Attachments: MAHOUT-833-final.patch, MAHOUT-833.patch
>
>
> Given input that is on HDFS, the SequenceFilesFrom.java classes should be 
> able to do their work in parallel.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1194) Allow to change java target version during the build

2013-06-02 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672612#comment-13672612
 ] 

Robin Anil commented on MAHOUT-1194:


No, thank you for contributing.

> Allow to change java target version during the build
> 
>
> Key: MAHOUT-1194
> URL: https://issues.apache.org/jira/browse/MAHOUT-1194
> Project: Mahout
>  Issue Type: Task
>Reporter: Jarek Jarcec Cecho
>Assignee: Jarek Jarcec Cecho
>Priority: Minor
> Attachments: bugMAHOUT-1194.patch
>
>
> It seems that current build have hard coded java target for JDK6. I think 
> that it would be useful to parametrise that, so that it can be easily 
> overridden on the command line.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-992:
--

Assignee: Matteo Riondato  (was: tom pierce)

> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Assignee: Matteo Riondato
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-992:
--

Assignee: tom pierce

> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Assignee: tom pierce
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1194) Allow to change java target version during the build

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1194:
---

Assignee: Jarek Jarcec Cecho  (was: Robin Anil)

> Allow to change java target version during the build
> 
>
> Key: MAHOUT-1194
> URL: https://issues.apache.org/jira/browse/MAHOUT-1194
> Project: Mahout
>  Issue Type: Task
>Reporter: Jarek Jarcec Cecho
>Assignee: Jarek Jarcec Cecho
>Priority: Minor
> Attachments: bugMAHOUT-1194.patch
>
>
> It seems that current build have hard coded java target for JDK6. I think 
> that it would be useful to parametrise that, so that it can be easily 
> overridden on the command line.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1224) Add the option of running a StreamingKMeans pass in the Reducer before BallKMeans

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1224:
---

Assignee: Dan Filimon

> Add the option of running a StreamingKMeans pass in the Reducer before 
> BallKMeans
> -
>
> Key: MAHOUT-1224
> URL: https://issues.apache.org/jira/browse/MAHOUT-1224
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Assignee: Dan Filimon
>
> Sometimes, the number of points passed to the reducer from the mappers in the 
> StreamingKMeansDriver job is too large to fit into memory.
> In that case, applying another StreamingKMeans pass can collapse the mapper 
> intermediate clusters to a more manageable size to be clustered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1154) Implementing Streaming KMeans

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1154:
---

Assignee: Dan Filimon

> Implementing Streaming KMeans
> -
>
> Key: MAHOUT-1154
> URL: https://issues.apache.org/jira/browse/MAHOUT-1154
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Assignee: Dan Filimon
> Fix For: 0.8
>
>
> An implementation of Streaming KMeans as mentioned in [1] is available here 
> [2].
> [1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3ccaowb3goyf9zufrgxhsucpkjxk6cw0nnr8gwg__jsey+kvab...@mail.gmail.com%3E
> [2] https://github.com/dfilimon/mahout
> Since there will be more than one patches, there will be specific JIRA issues 
> that address each one.
> The description of the code being added is:
> The main classes are in o.a.m.clustering.streaming [1], under the
> core/ project. These are subdivided into 2 packages:
> - cluster: contains the BallKMeans and StreamingKMeans classes that
> can be used standalone.
>   BallKMeans is exactly what it sounds like (uses k-means++ for the
> initialization, then does a normal k-means pass and ignoring
> outilers).
>   StreamingKMeans implements the online clustering that doesn't return
> exactly k clusters, (it returns an estimate). This is used to
> approximate the data.
> - mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
> StreamingKMeansMapper and StreamingKMeansReducer classes.
>   CentroidWritable serializes Centroids (sort of like AbstractCluster).
>   StreamingKMeansDriver provides the driver for the job.
>   StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
> sketches of the data for the reducer.
>   StreamingKMeansReducer collects the centroids produced by the
> mappers into one set of weighted points and runs BallKMeans on them
> producing the final results.
> Additionally the searchers are in o.a.m.math.neighborhood
> - neighborhood: various searcher classes that implement nearest-neighbor
> search using different strategies.
>   Searcher, UpdatableSearcher: abstract classes that define how to
> search through collections of vectors.
>   BruteSearch: does a brute search (looks at every point...)
>   ProjectionSearch: uses random projections for searching.
>   FastProjectionSearch: also uses random projections (but not binary
> search trees as in ProjectionSearch).
>   HashedVector, LocalitySensitiveHashSearch: implement locality
> sensitive hash search.
> All the tools that I used are in o.a.m.clustering.streaming [2], under
> the examples/ project.
> There are a bunch of classes here, covering everything from
> vectorizing 20 newsgroups data to various IO utils. The more important
> ones are:
>   utils.ExperimentUtils: convenience methods.
>   tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.
> [3] 
> https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
> [4] 
> https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming
> The relevant issues are:
> - MAHOUT-1155 (Centroid, WeightedVector)
> - MAHOUT-1156 (searchers)
> - MAHOUT-1162 (clustering, non map-reduce)
> - MAHOUT-1181 (map-reduce, command-line changes, pom.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-918) Implement SGD based classifiers using MapReduce

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-918:
--

Fix Version/s: Backlog
 Assignee: Ted Dunning

A lot of good progress on the review board and then silence. [~issaymk2] Can 
you revive this and work on it for the next release.

> Implement SGD based classifiers using MapReduce
> ---
>
> Key: MAHOUT-918
> URL: https://issues.apache.org/jira/browse/MAHOUT-918
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.6
>Reporter: issei yoshida
>Assignee: Ted Dunning
> Fix For: Backlog
>
> Attachments: design.pdf, MAHOUT-918.patch
>
>
> Implement SGD based classifiers (Logistic Regression, Adaptive Logistic 
> regression and Passive-Aggressive) using MapReduce.
> They are implemented using Iterative Parameter Mixtures algorithm which is 
> referred to in the following papers.
> http://research.google.com/pubs/pub36948.html
> http://aclweb.org/anthology-new/N/N10/N10-1069.pdf
> http://books.nips.cc/papers/files/nips22/NIPS2009_0345.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-833) Make conversion to sequence files map-reduce

2013-06-02 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672606#comment-13672606
 ] 

Robin Anil commented on MAHOUT-833:
---

Josh, its an old patch so naturally it didn't apply. Would you be able to 
update it by next week ? I will leave this on the 0.8 path. Its good to have in 
the release.

> Make conversion to sequence files map-reduce
> 
>
> Key: MAHOUT-833
> URL: https://issues.apache.org/jira/browse/MAHOUT-833
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.7
>Reporter: Grant Ingersoll
>  Labels: MAHOUT_INTRO_CONTRIBUTE
> Attachments: MAHOUT-833-final.patch, MAHOUT-833.patch
>
>
> Given input that is on HDFS, the SequenceFilesFrom.java classes should be 
> able to do their work in parallel.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-833) Make conversion to sequence files map-reduce

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-833:
--

Fix Version/s: 0.8

> Make conversion to sequence files map-reduce
> 
>
> Key: MAHOUT-833
> URL: https://issues.apache.org/jira/browse/MAHOUT-833
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.7
>Reporter: Grant Ingersoll
>  Labels: MAHOUT_INTRO_CONTRIBUTE
> Fix For: 0.8
>
> Attachments: MAHOUT-833-final.patch, MAHOUT-833.patch
>
>
> Given input that is on HDFS, the SequenceFilesFrom.java classes should be 
> able to do their work in parallel.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-932) RandomForest quits with ArrayIndexOutOfBoundsException while running sample

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-932:
--

Fix Version/s: Backlog

If anyone has a new update or a patch, feel free to bring it back to 0.8 queue.

> RandomForest quits with ArrayIndexOutOfBoundsException while running sample
> ---
>
> Key: MAHOUT-932
> URL: https://issues.apache.org/jira/browse/MAHOUT-932
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.6
> Environment: Mac OS X, current Mac OS shipped Java version, latest 
> checkout from 17.12.2011
> Dual Core MacBook Pro 2009, 8 Gb, SSD
>Reporter: Berttenfall M.
>Priority: Minor
>  Labels: Classifier, DecisionForest, RandomForest
> Fix For: Backlog
>
>
> Hello,
> when running the example under 
> https://cwiki.apache.org/MAHOUT/partial-implementation.html with the 
> recommended data sets several issues occur.
> First: ARFF files seem no longer to be supported, I've been using the UCI 
> format as recommended here 
> (https://cwiki.apache.org/MAHOUT/breiman-example.html). Using ARFF files, 
> Mahout quits when creating the description file (wrong number of attributes 
> in the string), using UCI format it works.
> The main error happends during the BuildForest step (I could not test 
> TestForest, due to missing tree).
> Running:
> $MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.df.mapreduce.BuildForest 
> -Dmapred.max.split.size=1874231 -d convertedData/data.data -ds KDDTrain+.info 
> -sl 5 -p -t 100 -o nsl-forest.
> I tested different split.size values. 1874231, 187423, 18742 give the 
> following error. 1874 does not finish on my machine (Dual Core MacBook Pro 
> 2009, 8 Gb, SSD).
> It quits after a while (map is almost done) with the following message:
> 11/12/17 16:23:24 INFO mapred.Task: Task 'attempt_local_0001_m_000998_0' done.
> 11/12/17 16:23:24 INFO mapred.Task: Task:attempt_local_0001_m_000999_0 is 
> done. And is in the process of commiting
> 11/12/17 16:23:24 INFO mapred.LocalJobRunner: 
> 11/12/17 16:23:24 INFO mapred.Task: Task attempt_local_0001_m_000999_0 is 
> allowed to commit now
> 11/12/17 16:23:24 INFO output.FileOutputCommitter: Saved output of task 
> 'attempt_local_0001_m_000999_0' to 
> file:/Users/martin/Documents/Studium/Master/LargeScaleProcessing/Repository/mahout_algorithms_evaluation/testingRandomForests/nsl-forest
> 11/12/17 16:23:27 INFO mapred.LocalJobRunner: 
> 11/12/17 16:23:27 INFO mapred.Task: Task 'attempt_local_0001_m_000999_0' done.
> 11/12/17 16:23:28 INFO mapred.JobClient:  map 100% reduce 0%
> 11/12/17 16:23:28 INFO mapred.JobClient: Job complete: job_local_0001
> 11/12/17 16:23:28 INFO mapred.JobClient: Counters: 8
> 11/12/17 16:23:28 INFO mapred.JobClient:   File Output Format Counters 
> 11/12/17 16:23:28 INFO mapred.JobClient: Bytes Written=41869032
> 11/12/17 16:23:28 INFO mapred.JobClient:   FileSystemCounters
> 11/12/17 16:23:28 INFO mapred.JobClient: FILE_BYTES_READ=37443033225
> 11/12/17 16:23:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44946910704
> 11/12/17 16:23:28 INFO mapred.JobClient:   File Input Format Counters 
> 11/12/17 16:23:28 INFO mapred.JobClient: Bytes Read=20478569
> 11/12/17 16:23:28 INFO mapred.JobClient:   Map-Reduce Framework
> 11/12/17 16:23:28 INFO mapred.JobClient: Map input records=125973
> 11/12/17 16:23:28 INFO mapred.JobClient: Spilled Records=0
> 11/12/17 16:23:28 INFO mapred.JobClient: Map output records=10
> 11/12/17 16:23:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=215000
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 100
>   at 
> org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.processOutput(PartialBuilder.java:126)
>   at 
> org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.parseOutput(PartialBuilder.java:89)
>   at 
> org.apache.mahout.classifier.df.mapreduce.Builder.build(Builder.java:303)
>   at 
> org.apache.mahout.classifier.df.mapreduce.BuildForest.buildForest(BuildForest.java:201)
>   at 
> org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:163)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>   at 
> org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:225)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>   

[jira] [Updated] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-953:
--

Fix Version/s: (was: 0.8)
   Backlog

Bring it back to 0.8 queue if anyone is willing to do the work within the next 
week.

> ArffVectorIterable does not gracefully handle duplicate attribute name
> --
>
> Key: MAHOUT-953
> URL: https://issues.apache.org/jira/browse/MAHOUT-953
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.6
>Reporter: Stuart Smith
>Priority: Trivial
> Fix For: Backlog
>
>
> If you have duplicate attribute names in your ARFF file, and you have 
> non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a 
> ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size 
> of your attribute labels (duplicates removed), but your arff vectors could 
> have more values (if they reference the attribute at both indexes). This is a 
> somewhat pathological ARFF file.
> Not sure if I should note the error (throw an exception) in computeNext() 
> when it's out of bounds, or when someone tries to add duplicate label to the 
> MapBackedArffModel.
> My first impulse would be to check in computeNext(), but addLabel() in 
> MapBackedArffModel will do something rather pathological in the case of 
> duplicate attributes: it overwrites the Label map with the new index, but the 
> idxLabel map will hold a mapping from both indexes to the attribute name, so 
> it's out of sync.. so it may be best to disallow duplicate attribute names 
> "IllegalArgumentException" altogether.
> For example
> @attribute my_attribute NUMERIC
> @attribute my_attribute NUMERIC
> addLabel()
> addLabel()
> labelBindings -> ('my_attribute', 1)
> idxLabel -> (0, 'my_attribute), (1, 'my_attribute')
> I'll happily submit a patch, just wondering if it should be in computeNext() 
> or addLabel()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name

2013-06-02 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672599#comment-13672599
 ] 

Robin Anil commented on MAHOUT-953:
---

Any takers for this for 0.8 ? If not I will assign this to the backlog.

> ArffVectorIterable does not gracefully handle duplicate attribute name
> --
>
> Key: MAHOUT-953
> URL: https://issues.apache.org/jira/browse/MAHOUT-953
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.6
>Reporter: Stuart Smith
>Priority: Trivial
> Fix For: 0.8
>
>
> If you have duplicate attribute names in your ARFF file, and you have 
> non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a 
> ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size 
> of your attribute labels (duplicates removed), but your arff vectors could 
> have more values (if they reference the attribute at both indexes). This is a 
> somewhat pathological ARFF file.
> Not sure if I should note the error (throw an exception) in computeNext() 
> when it's out of bounds, or when someone tries to add duplicate label to the 
> MapBackedArffModel.
> My first impulse would be to check in computeNext(), but addLabel() in 
> MapBackedArffModel will do something rather pathological in the case of 
> duplicate attributes: it overwrites the Label map with the new index, but the 
> idxLabel map will hold a mapping from both indexes to the attribute name, so 
> it's out of sync.. so it may be best to disallow duplicate attribute names 
> "IllegalArgumentException" altogether.
> For example
> @attribute my_attribute NUMERIC
> @attribute my_attribute NUMERIC
> addLabel()
> addLabel()
> labelBindings -> ('my_attribute', 1)
> idxLabel -> (0, 'my_attribute), (1, 'my_attribute')
> I'll happily submit a patch, just wondering if it should be in computeNext() 
> or addLabel()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-943) Improbe the way to make the split point on DF.

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-943:
--

Fix Version/s: Backlog
 Assignee: Deneche A. Hakim

[~adeneche] can you see if this can go in and/or resolve it appropriately

> Improbe the way to make the split point on DF.
> --
>
> Key: MAHOUT-943
> URL: https://issues.apache.org/jira/browse/MAHOUT-943
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Ikumasa Mukai
>Assignee: Deneche A. Hakim
>  Labels: DecisionForest
> Fix For: Backlog
>
> Attachments: MAHOUT-943.patch
>
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute 
> value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some 
> situation to  use the average value which is calced with the best IG value 
> and the 2nd value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-966) Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-966:
--

Assignee: Grant Ingersoll

> Mismatch in the number of points given by the clusterDumper and 
> ClusterOutputPostProcessor
> --
>
> Key: MAHOUT-966
> URL: https://issues.apache.org/jira/browse/MAHOUT-966
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 0.6
> Environment: hadoop 0.20.2 mahout 0.6 
>Reporter: Gaurav Redkar
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: cluster-dumper-output.txt, clusterpp-output.txt, 
> mtestdata.txt, points100dCCNorm.txt
>
>
>  After running the post processor the number of points that each cluster 
> contains is not matching the number of points each cluster should contain as 
> stated by clusterdumper.
>  
> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> the n mentioned in clusters-n-final against each cluster is different from 
> the number of points actually contained in d directory for each cluster. Any 
> idea why is this happening ...?  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-961) Modify the Tree/Forest Visualizer on DF.

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-961:
--

Assignee: Sebastian Schelter

> Modify the Tree/Forest Visualizer on DF.
> 
>
> Key: MAHOUT-961
> URL: https://issues.apache.org/jira/browse/MAHOUT-961
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ikumasa Mukai
>Assignee: Sebastian Schelter
>  Labels: RandomForest
> Fix For: 0.8
>
> Attachments: MAHOUT-961.patch, MAHOUT-961.patch, MAHOUT-961.patch
>
>
> The Tree/Forest visualizer (MAHOUT-926) has problems.
> 1) a un-complemented stem which has no leaf or node is shown.
> 2) all stems are not shown when the data doesn't have all categories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-968) Classifier based on restricted boltzmann machines

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-968:
--

Fix Version/s: (was: 0.8)
   Backlog
 Assignee: Robin Anil

I can be a reviewer if you are willing to work on it 

As I see now, It requires
A) Lot of stylistic cleanup.
B) Lot of code structuring cleanup (no typecasting please)
C) Tets. 



> Classifier based on restricted boltzmann machines
> -
>
> Key: MAHOUT-968
> URL: https://issues.apache.org/jira/browse/MAHOUT-968
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.7
>Reporter: Dirk Weißenborn
>Assignee: Robin Anil
>  Labels: classification, mnist
> Fix For: Backlog
>
> Attachments: MAHOUT-968.patch, MAHOUT-968.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> This is a proposal for a new classifier based on restricted boltzmann 
> machines. The development of this feature follows the paper on "Deep 
> Boltzmann Machines" (DBM) [1] from 2009. The proposed model (DBM) got an 
> error rate of 0.95% on the mnist dataset [2], which is really good. Main 
> parts of the implementation should also be applicable to other scenarios than 
> classification where restricted boltzmann machines are used (ref. MAHOUT-375).
> I am working on this feature right now, and the results are promising. The 
> only problem with the training algorithm is, that it is still mostly 
> sequential (if training batches are small, what they should be), which makes 
> Map/Reduce until now, not really beneficial. However, since the algorithm 
> itself is fast (for a training algorithm), training can be done on a single 
> machine in managable time.
> Testing of the algorithm is currently done on the mnist dataset itself to 
> reproduce results of [1]. As soon as results indicate, that everything is 
> working fine, I will upload the patch.
> [1] http://www.cs.toronto.edu/~hinton/absps/dbm.pdf
> [2] http://yann.lecun.com/exdb/mnist/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-976) Implement Multilayer Perceptron

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-976:
--

Assignee: Ted Dunning

Marking Ted as the assignee, who is the best reviewer for this.

> Implement Multilayer Perceptron
> ---
>
> Key: MAHOUT-976
> URL: https://issues.apache.org/jira/browse/MAHOUT-976
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.7
>Reporter: Christian Herta
>Assignee: Ted Dunning
>Priority: Minor
>  Labels: multilayer, networks, neural, perceptron
> Attachments: MAHOUT-976.patch, MAHOUT-976.patch, MAHOUT-976.patch, 
> MAHOUT-976.patch
>
>   Original Estimate: 80h
>  Remaining Estimate: 80h
>
> Implement a multi layer perceptron
>  * via Matrix Multiplication
>  * Learning by Backpropagation; implementing tricks by Yann LeCun et al.: 
> "Efficent Backprop"
>  * arbitrary number of hidden layers (also 0  - just the linear model)
>  * connection between proximate layers only 
>  * different cost and activation functions (different activation function in 
> each layer) 
>  * test of backprop by gradient checking 
>  * normalization of the inputs (storeable) as part of the model
>  
> First:
>  * implementation "stocastic gradient descent" like gradient machine
>  * simple gradient descent incl. momentum
> Later (new jira issues):  
>  * Distributed Batch learning (see below)  
>  * "Stacked (Denoising) Autoencoder" - Feature Learning
>  * advanced cost minimazation like 2nd order methods, conjugate gradient etc.
> Distribution of learning can be done by (batch learning):
>  1 Partioning of the data in x chunks 
>  2 Learning the weight changes as matrices in each chunk
>  3 Combining the matrixes and update of the weights - back to 2
> Maybe this procedure can be done with random parts of the chunks (distributed 
> quasi online learning). 
> Batch learning with delta-bar-delta heuristics for adapting the learning 
> rates.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-992:
--

Assignee: (was: tom pierce)

Didn't notice Matteo's update. It's yours, is it possible to clean it up for 
this release...? 

> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1022) Process Mining Algorithm Example in Mahout

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1022:
---

Fix Version/s: Backlog

John, Please file a patch/reviewboard when your code is ready for review.

> Process Mining Algorithm Example in Mahout 
> ---
>
> Key: MAHOUT-1022
> URL: https://issues.apache.org/jira/browse/MAHOUT-1022
> Project: Mahout
>  Issue Type: New Feature
>Reporter: John Leach
>Priority: Minor
> Fix For: Backlog
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> We currently have been implementing process mining algorithms in Hadoop for 
> generating business processes based on event logs.  Here is an example white 
> paper with some of the algorithms defined.
> http://cms.ieis.tue.nl/Beta/Files/WorkingPapers/Beta_wp166.pdf
> Our goal is to add a process mining example to the Mahout implementation for 
> others that might be interested in reverse engineering processes from event 
> logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-992:
--

Assignee: tom pierce

Tom, Its all yours

> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Assignee: tom pierce
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-996) Support NamedVectors in arff.vector job by convention

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-996:
--

Assignee: Sebastian Schelter

> Support NamedVectors in arff.vector job by convention
> -
>
> Key: MAHOUT-996
> URL: https://issues.apache.org/jira/browse/MAHOUT-996
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.7
> Environment: OS X
>Reporter: Andrew Harbick
>Assignee: Sebastian Schelter
>Priority: Minor
> Fix For: 0.8
>
> Attachments: forillustration.patch
>
>
> If you do something like:
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout arff.vector --input $PWD/file.arff 
> --dictOut file.bindings --output $PWD
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout kmeans --input $PWD/file.arff.mvc 
> --clusters $PWD/output/file.clusters --output $PWD/output --numClusters 3 
> --maxIter 1000 --clustering
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir 
> $PWD/output/clusters-*-final --pointsDir $PWD/output/clusteredPoints --output 
> $PWD/output/clusteranalyze.txt
> Currently you don't get any information out of clusterdump that helps you 
> identify which element from your source data is in which cluster.
> I did an patch for illustration of using an attribute (by convention) from 
> the ARFF file as the name for a NamedVector.  The result of clusterdump is 
> much easier to use:
> VL-18589{n=6165 c=[1.376, 879.144, 3.947, 10.691, 0.874, 1.266, 16.644, 
> 9.689, 2.207, 1.855] r=[0.484, 160.571, 1.959, 6.176, 0.551, 0.442, 34.125, 
> 7.953, 1.988, 0.352]}
> Weight : [props - optional]:  Point:
> 1.0: 4ee342afd04516354c000140 = [1.000, 597.000, 7.000, 7.000, 1.000, 
> 1.000, 11.000, 12.000, 6.000, 2.000]
> 1.0: 4ee49257eb8b3e28c60025a2 = [1.000, 597.000, 1.000, 7.000, 1.000, 
> 1.000, 8.000, 17.000, 6.000, 2.000]
> 1.0: 4ee60430ab2c714006000937 = [1.000, 597.000, 2.000, 9.000, 1.000, 
> 1.000, 21.000, 21.000, 2.000, 2.000]
> 1.0: 4ef2d580ab2c71231b0019ae = [0:1.000, 1:598.000, 2:5.000, 
> 3:3.000, 5:1.000, 6:4.000, 9:1.000]
> 1.0: 4eda14a30b5d3e655b0043e9 = [1.000, 599.000, 7.000, 8.000, 2.000, 
> 1.000, 15.000, 7.000, 3.000, 2.000]
> 1.0: 4edba62deb8b3e27e6000614 = [0:1.000, 1:599.000, 2:1.000, 
> 3:12.000, 4:1.000, 5:1.000, 6:3.000, 8:3.000, 9:2.000]
> 1.0: 4ede1ea6eb8b3e1f330050f4 = [0:1.000, 1:599.000, 2:3.000, 
> 3:9.000, 4:1.000, 5:1.000, 6:14.000, 7:20.000, 9:2.000]
> ...
> I haven't done serious Java in 15 years so the attached patch is just for 
> idea sake...
> Thanks,
> Andy

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1147:
---

Fix Version/s: (was: 0.7)
   0.8
 Assignee: Jake Mannix

Jake, this is on your side of the woods.

> CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random 
> matrix
> ---
>
> Key: MAHOUT-1147
> URL: https://issues.apache.org/jira/browse/MAHOUT-1147
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Eclipse IDE
> Java code base
> CVB0Driver Class
> setModelPaths(Job job, Path modelPath) - method
>Reporter: Jack Pay
>Assignee: Jake Mannix
>  Labels: bug, cvb, fix, suggestion
> Fix For: 0.8
>
> Attachments: MAHOUT-1147.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem:
> When training doc/topic model no paths for the term/topic model found 
> (outputs null).
> These paths are set using setModelPaths in CVB0Driver.
> Reason for Problem:
> Variety of Job instances call this method. 
> The Job is passed to the method instead of the Configuration object given to 
> the Job.
> The configuration is retrieved from the Job instance itself.
> I believe that this Configuration instance is a clone of the original.
> This is a problem as the variable MODEL_PATHS is set on the clone which is 
> then discarded when the given Job is complete.
> The original Configuration has no MODEL_PATHS String set and therefore 
> returns null.
> The code stipulates that if it cannot find a model to use a new random 
> matrix. This happens every time as MODEL_PATHS is not set for the 
> Configuration instance used.
> Solution:
> Do not pass the Job to the setModels method, but pass the Configuration 
> instance passed into the method which created the Job.
> i.e.
> change from:
> setModelPaths(Job job, Path modelPath)
> to:
> setModelPaths(Configuration conf, Path modelPath)
> And change all calling methods accordingly (obviously).
> So far what little testing I have done appears to solve this problem.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1126) Mac builds won't unjar

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1126:
---

Assignee: Grant Ingersoll

> Mac builds won't unjar
> --
>
> Key: MAHOUT-1126
> URL: https://issues.apache.org/jira/browse/MAHOUT-1126
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8
> Environment: Builds on the Mac
>Reporter: Pat Ferrel
>Assignee: Grant Ingersoll
>  Labels: build
> Fix For: 0.8
>
>
> On the Mac you have to remove the licenses in the mahout jar or hadoop can't 
> unjar mahout. The Mac has a case insensitive file system and so can't tell 
> the difference between LICENSE and license. This was fixed at one point 
> https://issues.apache.org/jira/browse/MAHOUT-780
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/license/
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/LICENSE/
> Looks like as is mentioned in 
> https://issues.apache.org/jira/browse/MAHOUT-780 
> mv target/maven-shared-archive-resources/META-INF/LICENSE 
> target/maven-shared-archive-resources/META-INF/LICENSES
> works too.
> Can this get a permanent fix?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1164) Make ARFF integration generate meta-data in JSON format

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1164:
---

Assignee: Ted Dunning

> Make ARFF integration generate meta-data in JSON format
> ---
>
> Key: MAHOUT-1164
> URL: https://issues.apache.org/jira/browse/MAHOUT-1164
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, Integration
>Affects Versions: 0.7
>Reporter: Marty Kube
>Assignee: Ted Dunning
> Attachments: MAHOUT-1164.patch, MAHOUT-1164.patch, MAHOUT-1164.patch
>
>
> Add a command line option to generate meta-data in a JSON format.
> This ticket supports the larger goal of making RF classifiers consume the 
> meta-data and sequence files generated by the integration components.  
> MAHOUT-1163 makes RF classifiers read JSON meta-data.  This ticket is to get 
> the ARFF integration to generate the same JSON formatted meta-data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1163) Make random forest classifier meta-data file human readable

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1163:
---

Assignee: Ted Dunning

> Make random forest classifier meta-data file human readable
> ---
>
> Key: MAHOUT-1163
> URL: https://issues.apache.org/jira/browse/MAHOUT-1163
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
>Reporter: Marty Kube
>Assignee: Ted Dunning
> Fix For: 0.8
>
> Attachments: MAHOUT-1163.patch
>
>
> The RF classifier has as a Describe utility which figures out a description 
> of a data set (how many attributes, types, and enumerated values, etc...) and 
> writes this meta-data to file for later use during training or testing.
> The file format is binary.  That means the only way to generate it is with 
> the Describe utility and it is hard to modify.  If the format was human 
> readable it is then possible to modify/generate the meta-data by hand.
> This will also make it easier to support standard formats such as ARFF.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1194) Allow to change java target version during the build

2013-06-02 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672554#comment-13672554
 ] 

Robin Anil commented on MAHOUT-1194:


Cool, I will do a one pass later today or tomorrow and add our regular 
contributors into that list. 

> Allow to change java target version during the build
> 
>
> Key: MAHOUT-1194
> URL: https://issues.apache.org/jira/browse/MAHOUT-1194
> Project: Mahout
>  Issue Type: Task
>Reporter: Jarek Jarcec Cecho
>Assignee: Robin Anil
>Priority: Minor
> Attachments: bugMAHOUT-1194.patch
>
>
> It seems that current build have hard coded java target for JDK6. I think 
> that it would be useful to parametrise that, so that it can be easily 
> overridden on the command line.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1231) "No input clusters found in " error in kmeans

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1231:
---

Affects Version/s: (was: Backlog)
Fix Version/s: Backlog

Not Blocking the release. Seems like a user-help issue

> "No input clusters found in " error in kmeans
> -
>
> Key: MAHOUT-1231
> URL: https://issues.apache.org/jira/browse/MAHOUT-1231
> Project: Mahout
>  Issue Type: Question
>  Components: Clustering
>Reporter: Summer Lee
> Fix For: Backlog
>
>
> 1.seqdirectory
> > mahout seqdirectory --input /user/hdfs/input/new1.csv --output
> > /user/hdfs/new1/seqdirectory --tempDir
> > /user/hdfs/new1/seqdirectory/tempDir
> 2.seq2sparse 
> > mahout seq2sparse --input /user/hdfs/new1/seqdirectory --output
> > /user/hdfs/new1/seq2sparse -wt tfidf
> 3.kmeans 
> > mahout kmeans --input /user/hdfs/new1/seq2sparse/tfidf-vectors
> > --output /user/hdfs/new1/kmeans -c /user/hdfs/new1/clusters/kmeans -x 3 -k 
> > 3 --tempDir /user/hdfs/new1/kmeans/tempDir
> and then error is occured
> Failing Oozie Launcher, Main class [org.apache.mahout.driver.MahoutDriver], 
> main() threw exception, No input clusters found in 
> /user/oozie/mahout/z3/kmeansCopy/clusters/part-randomSeed. Check your -c 
> argument.
> java.lang.IllegalStateException: No input clusters found in 
> /user/oozie/mahout/z3/kmeansCopy/clusters/part-randomSeed. Check your -c 
> argument.
>   at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:217)
>   at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:148)
>   at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:107)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>   at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:48)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>   at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>   at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:467)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Oozie Launcher failed, finishing Hadoop job gracefully
> Oozie Launcher ends
> ===
> Why kmeans driver can't make clusters in Hadoop with oozie system?
> In hadoop with not oozie system, it worked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1220:
---

Fix Version/s: Backlog
Affects Version/s: (was: Backlog)

> seqdirectory brings empty files out
> ---
>
> Key: MAHOUT-1220
> URL: https://issues.apache.org/jira/browse/MAHOUT-1220
> Project: Mahout
>  Issue Type: Bug
>Reporter: Summer Lee
>Priority: Minor
> Fix For: Backlog
>
>
> I put the input file on "mahout seqdirectory"  
> --> command
> mahout seqdirectory --input 
> user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output 
> /user/hdfs/mahout_test/output/final3/seqdirectory/
> but the result file, "chunk-0" contains like this.
> --> chunk-0
> SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
> I heard that chunk-0 files should have number like 
> SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
> I think my input file is something wrong, so I tried with other different 
> input files but results are same.
> How can I fix this? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-962) minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-962:
--

Resolution: Fixed
  Assignee: Robin Anil
Status: Resolved  (was: Patch Available)

Submitted to SVN

> minDF and maxDFPercent filtering doesnt get applied when output weight is tf 
> in SpareVecorsFromSequenceFile
> ---
>
> Key: MAHOUT-962
> URL: https://issues.apache.org/jira/browse/MAHOUT-962
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.6, 0.7, 0.8
>Reporter: John Conwell
>Assignee: Robin Anil
>Priority: Minor
>  Labels: patch
> Fix For: 0.8
>
> Attachments: mahout_962.patch
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957.  The 
> desired output is term frequency vectors, but I want terms filtered by their 
> min and max DF values. This might be valid in LDA, where tf vectors is 
> desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and 
> the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1194) Allow to change java target version during the build

2013-06-02 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672531#comment-13672531
 ] 

Robin Anil commented on MAHOUT-1194:


You have to be in a JIRA contributor list for me to assign you, otherwise you 
dont pop up in the assignee list. I don't know off the top of my head where 
that is. 

> Allow to change java target version during the build
> 
>
> Key: MAHOUT-1194
> URL: https://issues.apache.org/jira/browse/MAHOUT-1194
> Project: Mahout
>  Issue Type: Task
>Reporter: Jarek Jarcec Cecho
>Assignee: Robin Anil
>Priority: Minor
> Attachments: bugMAHOUT-1194.patch
>
>
> It seems that current build have hard coded java target for JDK6. I think 
> that it would be useful to parametrise that, so that it can be easily 
> overridden on the command line.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1194) Allow to change java target version during the build

2013-06-02 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1194:
---

Resolution: Fixed
  Assignee: Robin Anil
Status: Resolved  (was: Patch Available)

> Allow to change java target version during the build
> 
>
> Key: MAHOUT-1194
> URL: https://issues.apache.org/jira/browse/MAHOUT-1194
> Project: Mahout
>  Issue Type: Task
>Reporter: Jarek Jarcec Cecho
>Assignee: Robin Anil
>Priority: Minor
> Attachments: bugMAHOUT-1194.patch
>
>
> It seems that current build have hard coded java target for JDK6. I think 
> that it would be useful to parametrise that, so that it can be easily 
> overridden on the command line.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-884) Matrix Concatenate utility

2013-06-01 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-884:
--

Fix Version/s: (was: 0.8)
   Backlog

Not a blocker, might need some cleanup. Pushing to backlog

> Matrix Concatenate utility
> --
>
> Key: MAHOUT-884
> URL: https://issues.apache.org/jira/browse/MAHOUT-884
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Reporter: Lance Norskog
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: Backlog
>
> Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch
>
>
> Utility to concatenate matrices stored as SequenceFiles of vectors.
> Each pair in the SequenceFile is the IntWritable row number and a 
> VectorWritable.
> The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

2013-06-01 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-874:
--

Fix Version/s: Backlog

> Extract Writables into a separate module to allow smaller dependencies
> --
>
> Key: MAHOUT-874
> URL: https://issues.apache.org/jira/browse/MAHOUT-874
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Ted Dunning
> Fix For: Backlog
>
>
> The theory is that we can have a smaller jar if we only include writable 
> classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like 
> to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)

2013-06-01 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-1225.


Resolution: Fixed

> Sets and maps incorrectly clear() their state arrays (potential endless loops)
> --
>
> Key: MAHOUT-1225
> URL: https://issues.apache.org/jira/browse/MAHOUT-1225
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
> Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths 
> collections (Set) 0.7, hppc 0.4.3
>Reporter: Sophie Sperner
>Assignee: Dawid Weiss
>  Labels: hashset, java, mahout, test
> Fix For: 0.7
>
> Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, 
> MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The code I attached hangs on forever, Eclipse does not print me its stack 
> trace because it does not terminate the program. So I decided to make a small 
> test.java file that you can easily run.
> This code has the main function that simply runs getItemList() method which 
> successfully executes getDataset() method (here please download mushroom.dat 
> dataset and set the full path into filePath string variable) and the hangs on 
> (the problem happens on a fourth columnValues.add() call). After the dataset 
> was taken into X array, the code simply goes through X column by column and 
> searches for different items in it.
> If you uncomment IntSet columnValues = new IntOpenHashSet(); and 
> corresponding import headers then everything will work just fine (you will 
> also need to include hppc jar file found here 
> http://labs.carrotsearch.com/hppc.html or below in the attachment).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things

2013-06-01 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1236:
---

Fix Version/s: 1.0

+1 to protobuf tracking for the 1.0 release

> Need a cleaned up serialized format for Vectors to handle names and all other 
> kinds of things
> -
>
> Key: MAHOUT-1236
> URL: https://issues.apache.org/jira/browse/MAHOUT-1236
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ted Dunning
> Fix For: 1.0
>
>
> Our current serialization is subject several ills
> a) it breaks alignment by having a 1 byte flag field (evil, generic)
> b) it doesn't handle any kind of extensible format like protobufs so it isn't 
> future-proof
> c) it doesn't handle named vectors very well
> d) it totally breaks with any other kind of decoration as with Centroids or 
> WeightedVector or ... (see b)
> I propose that we use the current tag byte on the current serialization with 
> a new flag bit that indicates that the vector will use a protobuf encoding.  
> Then 3 bytes will be skipped to restore alignment.  Then there will be a 
> protobuf encoding for the vector. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-836) On donating my Robust PCA Java code to Mahout

2013-06-01 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-836:
--

Fix Version/s: Backlog

Sujit, please go ahead and create a patch if you are still interested in 
contributing. 



> On donating my Robust PCA Java code to Mahout
> -
>
> Key: MAHOUT-836
> URL: https://issues.apache.org/jira/browse/MAHOUT-836
> Project: Mahout
>  Issue Type: New JIRA Project
>  Components: Classification
> Environment: Platform independent
>Reporter: Sujit Nair
>  Labels: newbie
> Fix For: Backlog
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Hi All,
> I have an implementation of Robust PCA (a.k.a low rank and sparse 
> decomposition) in Java which I would like to donate to Mahout. I am a MATLAB 
> expert, comfortable with C++ and have just started with Java. I am completely 
> new to Mahout but am very excited to participate and contribute. 
> I have tested my code exhaustively and there does not seem to be any issues. 
> The results are very good but the code definitely needs some optimization. 
> Please let me know if there is interest. 
> Thanks,
> Sujit

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-01 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1211:
---

Fix Version/s: 0.8
 Assignee: Ted Dunning
Affects Version/s: (was: 0.7)

> Replace deprecated Closables.closeQuietly calls
> ---
>
> Key: MAHOUT-1211
> URL: https://issues.apache.org/jira/browse/MAHOUT-1211
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Stevo Slavic
>Assignee: Ted Dunning
>Priority: Minor
> Fix For: 0.8
>
>
> Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
> usage is a code smell, and that method is scheduled to be removed from Guava 
> 16.0.
> See [this 
> discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
> for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1128) MAHOUT-999 issue still actual

2013-06-01 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-1128.


Resolution: Duplicate

>  MAHOUT-999 issue still actual
> --
>
> Key: MAHOUT-1128
> URL: https://issues.apache.org/jira/browse/MAHOUT-1128
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.7
> Environment: I work on Hadoop 1.0.3 cluster deployed on Amazon EC2 
> virtual computers with Ubuntu 11 and mahout-core.jar 0.7 from maven-central.
> I run my application from separated "client" machine and it submits tasks to 
> cluster.
>Reporter: Andrey Davydov
>
> I'm sorry my english is not well and I'm newbie with Mahout. But it seems 
> that MAHOUT-999 issue still actual.
> I use mahout-core 0.7 loaded from maven-central and I've got the same fail. 
> I've investigate sources and found following in the 
> org.apache.mahout.clustering.classify.ClusterClassifier class:
>   public void writeToSeqFiles(Path path) throws IOException {
> writePolicy(policy, path);
> Configuration config = new Configuration();
> FileSystem fs = FileSystem.get(path.toUri(), config);
> SequenceFile.Writer writer = null;
> ClusterWritable cw = new ClusterWritable();
> for (int i = 0; i < models.size(); i++) {
> ...
>   } finally {
> Closeables.closeQuietly(writer);
>   }
> }
>   }
>   
>   public void readFromSeqFiles(Configuration conf, Path path) throws 
> IOException {
> Configuration config = new Configuration();
> List clusters = Lists.newArrayList();
> for (ClusterWritable cw : new 
> SequenceFileDirValueIterable(path, PathType.LIST,
> PathFilters.logsCRCFilter(), config)) {
> ...
> }
> this.models = clusters;
> modelClass = models.get(0).getClass().getName();
> this.policy = readPolicy(path);
>   }
> Both methods use new default Configuration and they try to work with local 
> file system. I.e. KMeansDriver wrote initial clusters to local file system of 
> the "client" system and CIMapper try to read it from cluster node local file 
> system.
> It seems that current implementation can work only pseudo-distributed hadoop 
> system. I think that ClusterClassifier should store intermediate results in 
> the HDFS using Configuration passed by api from user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (MAHOUT-941) Improve ConfusionMatrix statistics

2013-06-01 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-941 started by Robin Anil.

> Improve ConfusionMatrix statistics
> --
>
> Key: MAHOUT-941
> URL: https://issues.apache.org/jira/browse/MAHOUT-941
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Lance Norskog
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.8
>
> Attachments: Bayes.zip, MAHOUT-941.patch, MAHOUT-941.patch, 
> MAHOUT-941.patch, SGD.zip
>
>
> This patch adds more statistics to the ConfusionMatrix and RequestAnalyzer.
> # Add Kappa measure - a standard measure comparing a sample v.s. random 
> assignment.
> # Add mean & standard deviation of "Reliability" (User Accuracy) - assist in 
> identifying consistent mal-assignment against "good" and "bad" labels.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   >