[jira] [Commented] (MAHOUT-1219) LSHSearcher not always faster than BruteSearcher

2013-05-19 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661810#comment-13661810
 ] 

Suneel Marthi commented on MAHOUT-1219:
---

I could not reproduce the error I had reported before, lets resolve this and 
open a JIRA if the issue recurs.

> LSHSearcher not always faster than BruteSearcher
> 
>
> Key: MAHOUT-1219
> URL: https://issues.apache.org/jira/browse/MAHOUT-1219
> Project: Mahout
>  Issue Type: Test
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Priority: Minor
>
> This is a known issue and the performance of LocalitySensitiveHashSearch 
> needs to be further investigated.
> Currently, the one "benchmark" that does this, SearchQualityTest is too 
> variable to be informative.
> So, I'm removing LSHSearcher from SearchQualityTest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1217) Nearest neighbor searchers sometimes fail to remove points

2013-05-19 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661809#comment-13661809
 ] 

Suneel Marthi commented on MAHOUT-1217:
---

Dan, tested your fix with both Fast Projection Search and Locality Sensitive 
Search; and don't see this error anymore.

> Nearest neighbor searchers sometimes fail to remove points
> --
>
> Key: MAHOUT-1217
> URL: https://issues.apache.org/jira/browse/MAHOUT-1217
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.8
>Reporter: Dan Filimon
>
> When updating a Centroid in StreamingKMeans, the Centroid needs to be removed 
> and its updated version added.
> When removing points in a searcher that are already there, sometimes the 
> searcher fails to return the closest point (the one being searched for) 
> causing a RuntimeException.
> This has been observed for TF-IDF vectors with SquaredEuclideanDistance and 
> CosineDistance and FastProjectionSearch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-05-19 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661807#comment-13661807
 ] 

Ted Dunning commented on MAHOUT-1214:
-

The accuracy should be quite good if you use a single power step.

You can play with the algorithm using the R version of the algorithm[1].

See also Nathan Halko's dissertation and the arxiv paper on the subject [2].

The original JIRA issues [3,4] should be helpful as well.  Attached [5] to these
JIRA's is a description of an early version of the algorithm that was 
implemented.
Dmitriy developed alternatives for some of the steps and implemented a power 
step
to improve accuracy [6].

[1] https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html

[2] http://arxiv.org/abs/0909.4061

[3] https://issues.apache.org/jira/browse/MAHOUT-792

[4] https://issues.apache.org/jira/browse/MAHOUT-797

[5] https://issues.apache.org/jira/secure/attachment/12491074/sd-2.pdf

[6] https://issues.apache.org/jira/secure/attachment/12493978/MAHOUT-797.pdf


> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>  Labels: clustering, improvement
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1221:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch Committed

> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: patch
> Fix For: 0.8, 0.7
>
> Attachments: MAHOUT-1221.patch, MAHOUT-1221.patch
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}
> An example in which this bug manifests is the following:
> {code:title=QRDecomposition.java}
> x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
> {code}
> where Matrix x is not updated if it is an instance of SparseMatrix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned MAHOUT-1221:
-

Assignee: Suneel Marthi

> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: patch
> Fix For: 0.7, 0.8
>
> Attachments: MAHOUT-1221.patch, MAHOUT-1221.patch
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}
> An example in which this bug manifests is the following:
> {code:title=QRDecomposition.java}
> x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
> {code}
> where Matrix x is not updated if it is an instance of SparseMatrix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1221:


Status: Patch Available  (was: Open)

The patch to fix the bug is attached

> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Priority: Minor
>  Labels: patch
> Fix For: 0.8, 0.7
>
> Attachments: MAHOUT-1221.patch, MAHOUT-1221.patch
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}
> An example in which this bug manifests is the following:
> {code:title=QRDecomposition.java}
> x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
> {code}
> where Matrix x is not updated if it is an instance of SparseMatrix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1221:


Attachment: MAHOUT-1221.patch
MAHOUT-1221.patch

fix the typo in patch name

> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Priority: Minor
>  Labels: patch
> Fix For: 0.7, 0.8
>
> Attachments: MAHOUT-1221.patch, MAHOUT-1221.patch
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}
> An example in which this bug manifests is the following:
> {code:title=QRDecomposition.java}
> x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
> {code}
> where Matrix x is not updated if it is an instance of SparseMatrix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1221:


Attachment: (was: MAHOUT-1221.path)

> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Priority: Minor
>  Labels: patch
> Fix For: 0.7, 0.8
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}
> An example in which this bug manifests is the following:
> {code:title=QRDecomposition.java}
> x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
> {code}
> where Matrix x is not updated if it is an instance of SparseMatrix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1221:


Attachment: MAHOUT-1221.path

The patch to fix the reported bug

> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Priority: Minor
>  Labels: patch
> Fix For: 0.7, 0.8
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}
> An example in which this bug manifests is the following:
> {code:title=QRDecomposition.java}
> x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
> {code}
> where Matrix x is not updated if it is an instance of SparseMatrix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out

2013-05-19 Thread Summer Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Summer Lee updated MAHOUT-1220:
---

Description: 
I put the input file on "mahout seqdirectory"  
--> command
mahout seqdirectory --input 
user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output 
/user/hdfs/mahout_test/output/final3/seqdirectory/

but the result file, "chunk-0" contains like this.

--> chunk-0
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text

I heard that chunk-0 files should have number like 
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
I think my input file is something wrong, so I tried with other different input 
files but results are same.
How can I fix this? 

  was:
I put the input file on "mahout seqdirectory"  
--> command
/engn001/sbp/bigpack/mahout/bin/mahout seqdirectory --input 
user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output 
/user/hdfs/mahout_test/output/final3/seqdirectory/

but the result file, "chunk-0" contains like this.

--> chunk-0
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text

I heard that chunk-0 files should have number like 
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
I think my input file is something wrong, so I tried with other different input 
files but results are same.
How can I fix this? 


> seqdirectory brings empty files out
> ---
>
> Key: MAHOUT-1220
> URL: https://issues.apache.org/jira/browse/MAHOUT-1220
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
>Reporter: Summer Lee
> Fix For: 0.7
>
>
> I put the input file on "mahout seqdirectory"  
> --> command
> mahout seqdirectory --input 
> user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output 
> /user/hdfs/mahout_test/output/final3/seqdirectory/
> but the result file, "chunk-0" contains like this.
> --> chunk-0
> SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
> I heard that chunk-0 files should have number like 
> SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
> I think my input file is something wrong, so I tried with other different 
> input files but results are same.
> How can I fix this? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1221:


Status: Open  (was: Patch Available)

> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Priority: Minor
>  Labels: patch
> Fix For: 0.8, 0.7
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}
> An example in which this bug manifests is the following:
> {code:title=QRDecomposition.java}
> x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
> {code}
> where Matrix x is not updated if it is an instance of SparseMatrix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1221:


Labels: patch  (was: )
Status: Patch Available  (was: Open)

> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Priority: Minor
>  Labels: patch
> Fix For: 0.8, 0.7
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}
> An example in which this bug manifests is the following:
> {code:title=QRDecomposition.java}
> x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
> {code}
> where Matrix x is not updated if it is an instance of SparseMatrix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661764#comment-13661764
 ] 

Suneel Marthi commented on MAHOUT-1221:
---

Would you like to submit a patch?

> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Priority: Minor
> Fix For: 0.7, 0.8
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}
> An example in which this bug manifests is the following:
> {code:title=QRDecomposition.java}
> x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
> {code}
> where Matrix x is not updated if it is an instance of SparseMatrix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1221:


Description: 
The implementation returns a new vector if it already does not exist. But it 
does not add the new vector to the matrix. So, the later changes will not be 
reflected in the matrix.
{code:java}
if (res == null) {
res = newRandomAccessSparceVector(columnSize());
//now the row must be added by assignRow(row, res);
}
return res;
{code}
An example in which this bug manifests is the following:
{code:title=QRDecomposition.java}
x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
{code}
where Matrix x is not updated if it is an instance of SparseMatrix.

  was:
The implementation returns a new vector if it already does not exist. But it 
does not add the new vector to the matrix. So, the later changes will not be 
reflected in the matrix.
{code:java}
if (res == null) {
res = newRandomAccessSparceVector(columnSize());
//now the row must be added by assignRow(row, res);
}
return res;
{code}


> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Priority: Minor
> Fix For: 0.7, 0.8
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}
> An example in which this bug manifests is the following:
> {code:title=QRDecomposition.java}
> x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k)));
> {code}
> where Matrix x is not updated if it is an instance of SparseMatrix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1221:


Description: 
The implementation returns a new vector if it already does not exist. But it 
does not add the new vector to the matrix. So, the later changes will not be 
reflected in the matrix.
{code:java}
if (res == null) {
res = newRandomAccessSparceVector(columnSize());
//now the row must be added by assignRow(row, res);
}
return res;
{code}

  was:
The implementation returns a new vector if it already does not exist. But it 
does not add the new vector to the matrix. So, the later changes will not be 
reflected in the matrix.
if (res == null) {
res = newRandomAccessSparceVector(columnSize());
//now the row must be added by assignRow(row, res);
}
return res;


> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Priority: Minor
> Fix For: 0.7, 0.8
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> {code:java}
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1221:


Description: 
The implementation returns a new vector if it already does not exist. But it 
does not add the new vector to the matrix. So, the later changes will not be 
reflected in the matrix.
if (res == null) {
res = newRandomAccessSparceVector(columnSize());
//now the row must be added by assignRow(row, res);
}
return res;

> SparseMatrix.viewRow is sometimes readonly
> --
>
> Key: MAHOUT-1221
> URL: https://issues.apache.org/jira/browse/MAHOUT-1221
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.7
>Reporter: Maysam Yabandeh
>Priority: Minor
> Fix For: 0.7, 0.8
>
>
> The implementation returns a new vector if it already does not exist. But it 
> does not add the new vector to the matrix. So, the later changes will not be 
> reflected in the matrix.
> if (res == null) {
> res = newRandomAccessSparceVector(columnSize());
> //now the row must be added by assignRow(row, res);
> }
> return res;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly

2013-05-19 Thread Maysam Yabandeh (JIRA)
Maysam Yabandeh created MAHOUT-1221:
---

 Summary: SparseMatrix.viewRow is sometimes readonly
 Key: MAHOUT-1221
 URL: https://issues.apache.org/jira/browse/MAHOUT-1221
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7
Reporter: Maysam Yabandeh
Priority: Minor
 Fix For: 0.8, 0.7




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out

2013-05-19 Thread Summer Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Summer Lee updated MAHOUT-1220:
---

Description: 
I put the input file on "mahout seqdirectory"  
--> command
/engn001/sbp/bigpack/mahout/bin/mahout seqdirectory --input 
user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output 
/user/hdfs/mahout_test/output/final3/seqdirectory/

but the result file, "chunk-0" contains like this.

--> chunk-0
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text

I heard that chunk-0 files should have number like 
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
I think my input file is something wrong, so I tried with other different input 
files but results are same.
How can I fix this? 

  was:
I put the input file on "mahout seqdirectory"  but the result file, "chunk-0" 
contains like this.

--> chunk-0
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text

I heard that chunk-0 files should have number like 
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
I think my input file is something wrong, so I tried with other different input 
files but results are same.
How can I fix this? 


> seqdirectory brings empty files out
> ---
>
> Key: MAHOUT-1220
> URL: https://issues.apache.org/jira/browse/MAHOUT-1220
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
>Reporter: Summer Lee
> Fix For: 0.7
>
>
> I put the input file on "mahout seqdirectory"  
> --> command
> /engn001/sbp/bigpack/mahout/bin/mahout seqdirectory --input 
> user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output 
> /user/hdfs/mahout_test/output/final3/seqdirectory/
> but the result file, "chunk-0" contains like this.
> --> chunk-0
> SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
> I heard that chunk-0 files should have number like 
> SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
> I think my input file is something wrong, so I tried with other different 
> input files but results are same.
> How can I fix this? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out

2013-05-19 Thread Summer Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Summer Lee updated MAHOUT-1220:
---

Description: 
I put the input file on "mahout seqdirectory"  but the result file, "chunk-0" 
contains like this.

--> chunk-0
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text

I heard that chunk-0 files should have number like 
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
I think my input file is something wrong, so I tried with other different input 
files but results are same.
How can I fix this? 

  was:
I put the input file on "mahout seqdirectory"  but the result file, "chunk-0" 
contains like this.

--> chunk-0
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text

I heard that chunk-0 files should have number like 
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
So I tried other different files but results are same.
How can I fix this? 


> seqdirectory brings empty files out
> ---
>
> Key: MAHOUT-1220
> URL: https://issues.apache.org/jira/browse/MAHOUT-1220
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
>Reporter: Summer Lee
> Fix For: 0.7
>
>
> I put the input file on "mahout seqdirectory"  but the result file, "chunk-0" 
> contains like this.
> --> chunk-0
> SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
> I heard that chunk-0 files should have number like 
> SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
> I think my input file is something wrong, so I tried with other different 
> input files but results are same.
> How can I fix this? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-05-19 Thread Yiqun Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661751#comment-13661751
 ] 

Yiqun Hu commented on MAHOUT-1214:
--

I haven't got a chance study the ball k-means or streaming k-means algorithm. 
My comment is the same as Quinn: if the performance and accuracy of SSVD is 
verified seriously ( I mean comparing with Lanczos solver in different 
situations), it will be a good idea.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>  Labels: clustering, improvement
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out

2013-05-19 Thread Summer Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Summer Lee updated MAHOUT-1220:
---

Description: 
I put the input file on "mahout seqdirectory"  but the result file, "chunk-0" 
contains like this.

--> chunk-0
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text

I heard that chunk-0 files should have number like 
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
So I tried other different files but results are same.
How can I fix this? 

  was:
I put the input file on seqdirectory but result file "chunk-0" contains like 
this.
--> chunk-0
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text

I tried other different files but results are same.

How can I fix this? 


> seqdirectory brings empty files out
> ---
>
> Key: MAHOUT-1220
> URL: https://issues.apache.org/jira/browse/MAHOUT-1220
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
>Reporter: Summer Lee
> Fix For: 0.7
>
>
> I put the input file on "mahout seqdirectory"  but the result file, "chunk-0" 
> contains like this.
> --> chunk-0
> SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
> I heard that chunk-0 files should have number like 
> SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
> So I tried other different files but results are same.
> How can I fix this? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1220) seqdirectory brings empty files out

2013-05-19 Thread Summer Lee (JIRA)
Summer Lee created MAHOUT-1220:
--

 Summary: seqdirectory brings empty files out
 Key: MAHOUT-1220
 URL: https://issues.apache.org/jira/browse/MAHOUT-1220
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.7
Reporter: Summer Lee
 Fix For: 0.7


I put the input file on seqdirectory but result file "chunk-0" contains like 
this.
--> chunk-0
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text

I tried other different files but results are same.

How can I fix this? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-05-19 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661612#comment-13661612
 ] 

Ted Dunning commented on MAHOUT-1214:
-

{quote}
You mean use SSVD exclusively in place of Lanczos?
{quote}

Yes.  Exactly.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>  Labels: clustering, improvement
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-05-19 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661587#comment-13661587
 ] 

Shannon Quinn commented on MAHOUT-1214:
---

Ted,

I'm not sure I follow. You mean use SSVD exclusively in place of Lanczos?

I'd love to assess performance and accuracy with ball or streaming k-means 
instead. That's an excellent idea.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>  Labels: clustering, improvement
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira