[jira] [Commented] (MAHOUT-1219) LSHSearcher not always faster than BruteSearcher
[ https://issues.apache.org/jira/browse/MAHOUT-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661810#comment-13661810 ] Suneel Marthi commented on MAHOUT-1219: --- I could not reproduce the error I had reported before, lets resolve this and open a JIRA if the issue recurs. > LSHSearcher not always faster than BruteSearcher > > > Key: MAHOUT-1219 > URL: https://issues.apache.org/jira/browse/MAHOUT-1219 > Project: Mahout > Issue Type: Test >Affects Versions: 0.8 >Reporter: Dan Filimon >Priority: Minor > > This is a known issue and the performance of LocalitySensitiveHashSearch > needs to be further investigated. > Currently, the one "benchmark" that does this, SearchQualityTest is too > variable to be informative. > So, I'm removing LSHSearcher from SearchQualityTest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1217) Nearest neighbor searchers sometimes fail to remove points
[ https://issues.apache.org/jira/browse/MAHOUT-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661809#comment-13661809 ] Suneel Marthi commented on MAHOUT-1217: --- Dan, tested your fix with both Fast Projection Search and Locality Sensitive Search; and don't see this error anymore. > Nearest neighbor searchers sometimes fail to remove points > -- > > Key: MAHOUT-1217 > URL: https://issues.apache.org/jira/browse/MAHOUT-1217 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.8 >Reporter: Dan Filimon > > When updating a Centroid in StreamingKMeans, the Centroid needs to be removed > and its updated version added. > When removing points in a searcher that are already there, sometimes the > searcher fails to return the closest point (the one being searched for) > causing a RuntimeException. > This has been observed for TF-IDF vectors with SquaredEuclideanDistance and > CosineDistance and FastProjectionSearch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661807#comment-13661807 ] Ted Dunning commented on MAHOUT-1214: - The accuracy should be quite good if you use a single power step. You can play with the algorithm using the R version of the algorithm[1]. See also Nathan Halko's dissertation and the arxiv paper on the subject [2]. The original JIRA issues [3,4] should be helpful as well. Attached [5] to these JIRA's is a description of an early version of the algorithm that was implemented. Dmitriy developed alternatives for some of the steps and implemented a power step to improve accuracy [6]. [1] https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html [2] http://arxiv.org/abs/0909.4061 [3] https://issues.apache.org/jira/browse/MAHOUT-792 [4] https://issues.apache.org/jira/browse/MAHOUT-797 [5] https://issues.apache.org/jira/secure/attachment/12491074/sd-2.pdf [6] https://issues.apache.org/jira/secure/attachment/12493978/MAHOUT-797.pdf > Improve the accuracy of the Spectral KMeans Method > -- > > Key: MAHOUT-1214 > URL: https://issues.apache.org/jira/browse/MAHOUT-1214 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.7 > Environment: Mahout 0.7 >Reporter: Yiqun Hu > Labels: clustering, improvement > > The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. > NIPS 2002) in version 0.7 has two serious issues. These two incorrect > implementations make it fail even for a very obvious trivial dataset. We have > implemented a solution to resolve these two issues and hope to contribute > back to the community. > # Issue 1: > The EigenVerificationJob in version 0.7 does not check the orthogonality of > eigenvectors, which is necessary to obtain the correct clustering results for > the case of K>1; We have an idea and implementation to select based on > cosAngle/orthogonality; > # Issue 2: > The random seed initialization of KMeans algorithm is not optimal and > sometimes a bad initialization will generate wrong clustering result. In this > case, the selected K eigenvector actually provides a better way to initalize > cluster centroids because each selected eigenvector is a relaxed indicator of > the memberships of one cluster. For every selected eigenvector, we use the > data point whose eigen component achieves the maximum absolute value. > We have already verified our improvement on synthetic dataset and it shows > that the improved version get the optimal clustering result while the current > 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-1221: -- Resolution: Fixed Status: Resolved (was: Patch Available) Patch Committed > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Assignee: Suneel Marthi >Priority: Minor > Labels: patch > Fix For: 0.8, 0.7 > > Attachments: MAHOUT-1221.patch, MAHOUT-1221.patch > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} > An example in which this bug manifests is the following: > {code:title=QRDecomposition.java} > x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); > {code} > where Matrix x is not updated if it is an instance of SparseMatrix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi reassigned MAHOUT-1221: - Assignee: Suneel Marthi > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Assignee: Suneel Marthi >Priority: Minor > Labels: patch > Fix For: 0.7, 0.8 > > Attachments: MAHOUT-1221.patch, MAHOUT-1221.patch > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} > An example in which this bug manifests is the following: > {code:title=QRDecomposition.java} > x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); > {code} > where Matrix x is not updated if it is an instance of SparseMatrix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1221: Status: Patch Available (was: Open) The patch to fix the bug is attached > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Priority: Minor > Labels: patch > Fix For: 0.8, 0.7 > > Attachments: MAHOUT-1221.patch, MAHOUT-1221.patch > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} > An example in which this bug manifests is the following: > {code:title=QRDecomposition.java} > x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); > {code} > where Matrix x is not updated if it is an instance of SparseMatrix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1221: Attachment: MAHOUT-1221.patch MAHOUT-1221.patch fix the typo in patch name > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Priority: Minor > Labels: patch > Fix For: 0.7, 0.8 > > Attachments: MAHOUT-1221.patch, MAHOUT-1221.patch > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} > An example in which this bug manifests is the following: > {code:title=QRDecomposition.java} > x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); > {code} > where Matrix x is not updated if it is an instance of SparseMatrix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1221: Attachment: (was: MAHOUT-1221.path) > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Priority: Minor > Labels: patch > Fix For: 0.7, 0.8 > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} > An example in which this bug manifests is the following: > {code:title=QRDecomposition.java} > x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); > {code} > where Matrix x is not updated if it is an instance of SparseMatrix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1221: Attachment: MAHOUT-1221.path The patch to fix the reported bug > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Priority: Minor > Labels: patch > Fix For: 0.7, 0.8 > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} > An example in which this bug manifests is the following: > {code:title=QRDecomposition.java} > x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); > {code} > where Matrix x is not updated if it is an instance of SparseMatrix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out
[ https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Summer Lee updated MAHOUT-1220: --- Description: I put the input file on "mahout seqdirectory" --> command mahout seqdirectory --input user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output /user/hdfs/mahout_test/output/final3/seqdirectory/ but the result file, "chunk-0" contains like this. --> chunk-0 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text I heard that chunk-0 files should have number like SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... I think my input file is something wrong, so I tried with other different input files but results are same. How can I fix this? was: I put the input file on "mahout seqdirectory" --> command /engn001/sbp/bigpack/mahout/bin/mahout seqdirectory --input user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output /user/hdfs/mahout_test/output/final3/seqdirectory/ but the result file, "chunk-0" contains like this. --> chunk-0 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text I heard that chunk-0 files should have number like SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... I think my input file is something wrong, so I tried with other different input files but results are same. How can I fix this? > seqdirectory brings empty files out > --- > > Key: MAHOUT-1220 > URL: https://issues.apache.org/jira/browse/MAHOUT-1220 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.7 >Reporter: Summer Lee > Fix For: 0.7 > > > I put the input file on "mahout seqdirectory" > --> command > mahout seqdirectory --input > user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output > /user/hdfs/mahout_test/output/final3/seqdirectory/ > but the result file, "chunk-0" contains like this. > --> chunk-0 > SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text > I heard that chunk-0 files should have number like > SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... > I think my input file is something wrong, so I tried with other different > input files but results are same. > How can I fix this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1221: Status: Open (was: Patch Available) > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Priority: Minor > Labels: patch > Fix For: 0.8, 0.7 > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} > An example in which this bug manifests is the following: > {code:title=QRDecomposition.java} > x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); > {code} > where Matrix x is not updated if it is an instance of SparseMatrix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1221: Labels: patch (was: ) Status: Patch Available (was: Open) > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Priority: Minor > Labels: patch > Fix For: 0.8, 0.7 > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} > An example in which this bug manifests is the following: > {code:title=QRDecomposition.java} > x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); > {code} > where Matrix x is not updated if it is an instance of SparseMatrix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661764#comment-13661764 ] Suneel Marthi commented on MAHOUT-1221: --- Would you like to submit a patch? > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Priority: Minor > Fix For: 0.7, 0.8 > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} > An example in which this bug manifests is the following: > {code:title=QRDecomposition.java} > x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); > {code} > where Matrix x is not updated if it is an instance of SparseMatrix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1221: Description: The implementation returns a new vector if it already does not exist. But it does not add the new vector to the matrix. So, the later changes will not be reflected in the matrix. {code:java} if (res == null) { res = newRandomAccessSparceVector(columnSize()); //now the row must be added by assignRow(row, res); } return res; {code} An example in which this bug manifests is the following: {code:title=QRDecomposition.java} x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); {code} where Matrix x is not updated if it is an instance of SparseMatrix. was: The implementation returns a new vector if it already does not exist. But it does not add the new vector to the matrix. So, the later changes will not be reflected in the matrix. {code:java} if (res == null) { res = newRandomAccessSparceVector(columnSize()); //now the row must be added by assignRow(row, res); } return res; {code} > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Priority: Minor > Fix For: 0.7, 0.8 > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} > An example in which this bug manifests is the following: > {code:title=QRDecomposition.java} > x.viewRow(k).assign(y.viewRow(k), Functions.plusMult(1 / r.get(k, k))); > {code} > where Matrix x is not updated if it is an instance of SparseMatrix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1221: Description: The implementation returns a new vector if it already does not exist. But it does not add the new vector to the matrix. So, the later changes will not be reflected in the matrix. {code:java} if (res == null) { res = newRandomAccessSparceVector(columnSize()); //now the row must be added by assignRow(row, res); } return res; {code} was: The implementation returns a new vector if it already does not exist. But it does not add the new vector to the matrix. So, the later changes will not be reflected in the matrix. if (res == null) { res = newRandomAccessSparceVector(columnSize()); //now the row must be added by assignRow(row, res); } return res; > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Priority: Minor > Fix For: 0.7, 0.8 > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > {code:java} > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
[ https://issues.apache.org/jira/browse/MAHOUT-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1221: Description: The implementation returns a new vector if it already does not exist. But it does not add the new vector to the matrix. So, the later changes will not be reflected in the matrix. if (res == null) { res = newRandomAccessSparceVector(columnSize()); //now the row must be added by assignRow(row, res); } return res; > SparseMatrix.viewRow is sometimes readonly > -- > > Key: MAHOUT-1221 > URL: https://issues.apache.org/jira/browse/MAHOUT-1221 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.7 >Reporter: Maysam Yabandeh >Priority: Minor > Fix For: 0.7, 0.8 > > > The implementation returns a new vector if it already does not exist. But it > does not add the new vector to the matrix. So, the later changes will not be > reflected in the matrix. > if (res == null) { > res = newRandomAccessSparceVector(columnSize()); > //now the row must be added by assignRow(row, res); > } > return res; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1221) SparseMatrix.viewRow is sometimes readonly
Maysam Yabandeh created MAHOUT-1221: --- Summary: SparseMatrix.viewRow is sometimes readonly Key: MAHOUT-1221 URL: https://issues.apache.org/jira/browse/MAHOUT-1221 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7 Reporter: Maysam Yabandeh Priority: Minor Fix For: 0.8, 0.7 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out
[ https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Summer Lee updated MAHOUT-1220: --- Description: I put the input file on "mahout seqdirectory" --> command /engn001/sbp/bigpack/mahout/bin/mahout seqdirectory --input user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output /user/hdfs/mahout_test/output/final3/seqdirectory/ but the result file, "chunk-0" contains like this. --> chunk-0 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text I heard that chunk-0 files should have number like SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... I think my input file is something wrong, so I tried with other different input files but results are same. How can I fix this? was: I put the input file on "mahout seqdirectory" but the result file, "chunk-0" contains like this. --> chunk-0 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text I heard that chunk-0 files should have number like SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... I think my input file is something wrong, so I tried with other different input files but results are same. How can I fix this? > seqdirectory brings empty files out > --- > > Key: MAHOUT-1220 > URL: https://issues.apache.org/jira/browse/MAHOUT-1220 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.7 >Reporter: Summer Lee > Fix For: 0.7 > > > I put the input file on "mahout seqdirectory" > --> command > /engn001/sbp/bigpack/mahout/bin/mahout seqdirectory --input > user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output > /user/hdfs/mahout_test/output/final3/seqdirectory/ > but the result file, "chunk-0" contains like this. > --> chunk-0 > SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text > I heard that chunk-0 files should have number like > SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... > I think my input file is something wrong, so I tried with other different > input files but results are same. > How can I fix this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out
[ https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Summer Lee updated MAHOUT-1220: --- Description: I put the input file on "mahout seqdirectory" but the result file, "chunk-0" contains like this. --> chunk-0 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text I heard that chunk-0 files should have number like SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... I think my input file is something wrong, so I tried with other different input files but results are same. How can I fix this? was: I put the input file on "mahout seqdirectory" but the result file, "chunk-0" contains like this. --> chunk-0 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text I heard that chunk-0 files should have number like SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... So I tried other different files but results are same. How can I fix this? > seqdirectory brings empty files out > --- > > Key: MAHOUT-1220 > URL: https://issues.apache.org/jira/browse/MAHOUT-1220 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.7 >Reporter: Summer Lee > Fix For: 0.7 > > > I put the input file on "mahout seqdirectory" but the result file, "chunk-0" > contains like this. > --> chunk-0 > SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text > I heard that chunk-0 files should have number like > SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... > I think my input file is something wrong, so I tried with other different > input files but results are same. > How can I fix this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661751#comment-13661751 ] Yiqun Hu commented on MAHOUT-1214: -- I haven't got a chance study the ball k-means or streaming k-means algorithm. My comment is the same as Quinn: if the performance and accuracy of SSVD is verified seriously ( I mean comparing with Lanczos solver in different situations), it will be a good idea. > Improve the accuracy of the Spectral KMeans Method > -- > > Key: MAHOUT-1214 > URL: https://issues.apache.org/jira/browse/MAHOUT-1214 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.7 > Environment: Mahout 0.7 >Reporter: Yiqun Hu > Labels: clustering, improvement > > The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. > NIPS 2002) in version 0.7 has two serious issues. These two incorrect > implementations make it fail even for a very obvious trivial dataset. We have > implemented a solution to resolve these two issues and hope to contribute > back to the community. > # Issue 1: > The EigenVerificationJob in version 0.7 does not check the orthogonality of > eigenvectors, which is necessary to obtain the correct clustering results for > the case of K>1; We have an idea and implementation to select based on > cosAngle/orthogonality; > # Issue 2: > The random seed initialization of KMeans algorithm is not optimal and > sometimes a bad initialization will generate wrong clustering result. In this > case, the selected K eigenvector actually provides a better way to initalize > cluster centroids because each selected eigenvector is a relaxed indicator of > the memberships of one cluster. For every selected eigenvector, we use the > data point whose eigen component achieves the maximum absolute value. > We have already verified our improvement on synthetic dataset and it shows > that the improved version get the optimal clustering result while the current > 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out
[ https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Summer Lee updated MAHOUT-1220: --- Description: I put the input file on "mahout seqdirectory" but the result file, "chunk-0" contains like this. --> chunk-0 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text I heard that chunk-0 files should have number like SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... So I tried other different files but results are same. How can I fix this? was: I put the input file on seqdirectory but result file "chunk-0" contains like this. --> chunk-0 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text I tried other different files but results are same. How can I fix this? > seqdirectory brings empty files out > --- > > Key: MAHOUT-1220 > URL: https://issues.apache.org/jira/browse/MAHOUT-1220 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.7 >Reporter: Summer Lee > Fix For: 0.7 > > > I put the input file on "mahout seqdirectory" but the result file, "chunk-0" > contains like this. > --> chunk-0 > SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text > I heard that chunk-0 files should have number like > SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... > So I tried other different files but results are same. > How can I fix this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1220) seqdirectory brings empty files out
Summer Lee created MAHOUT-1220: -- Summary: seqdirectory brings empty files out Key: MAHOUT-1220 URL: https://issues.apache.org/jira/browse/MAHOUT-1220 Project: Mahout Issue Type: Bug Affects Versions: 0.7 Reporter: Summer Lee Fix For: 0.7 I put the input file on seqdirectory but result file "chunk-0" contains like this. --> chunk-0 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text I tried other different files but results are same. How can I fix this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661612#comment-13661612 ] Ted Dunning commented on MAHOUT-1214: - {quote} You mean use SSVD exclusively in place of Lanczos? {quote} Yes. Exactly. > Improve the accuracy of the Spectral KMeans Method > -- > > Key: MAHOUT-1214 > URL: https://issues.apache.org/jira/browse/MAHOUT-1214 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.7 > Environment: Mahout 0.7 >Reporter: Yiqun Hu > Labels: clustering, improvement > > The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. > NIPS 2002) in version 0.7 has two serious issues. These two incorrect > implementations make it fail even for a very obvious trivial dataset. We have > implemented a solution to resolve these two issues and hope to contribute > back to the community. > # Issue 1: > The EigenVerificationJob in version 0.7 does not check the orthogonality of > eigenvectors, which is necessary to obtain the correct clustering results for > the case of K>1; We have an idea and implementation to select based on > cosAngle/orthogonality; > # Issue 2: > The random seed initialization of KMeans algorithm is not optimal and > sometimes a bad initialization will generate wrong clustering result. In this > case, the selected K eigenvector actually provides a better way to initalize > cluster centroids because each selected eigenvector is a relaxed indicator of > the memberships of one cluster. For every selected eigenvector, we use the > data point whose eigen component achieves the maximum absolute value. > We have already verified our improvement on synthetic dataset and it shows > that the improved version get the optimal clustering result while the current > 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661587#comment-13661587 ] Shannon Quinn commented on MAHOUT-1214: --- Ted, I'm not sure I follow. You mean use SSVD exclusively in place of Lanczos? I'd love to assess performance and accuracy with ball or streaming k-means instead. That's an excellent idea. > Improve the accuracy of the Spectral KMeans Method > -- > > Key: MAHOUT-1214 > URL: https://issues.apache.org/jira/browse/MAHOUT-1214 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.7 > Environment: Mahout 0.7 >Reporter: Yiqun Hu > Labels: clustering, improvement > > The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. > NIPS 2002) in version 0.7 has two serious issues. These two incorrect > implementations make it fail even for a very obvious trivial dataset. We have > implemented a solution to resolve these two issues and hope to contribute > back to the community. > # Issue 1: > The EigenVerificationJob in version 0.7 does not check the orthogonality of > eigenvectors, which is necessary to obtain the correct clustering results for > the case of K>1; We have an idea and implementation to select based on > cosAngle/orthogonality; > # Issue 2: > The random seed initialization of KMeans algorithm is not optimal and > sometimes a bad initialization will generate wrong clustering result. In this > case, the selected K eigenvector actually provides a better way to initalize > cluster centroids because each selected eigenvector is a relaxed indicator of > the memberships of one cluster. For every selected eigenvector, we use the > data point whose eigen component achieves the maximum absolute value. > We have already verified our improvement on synthetic dataset and it shows > that the improved version get the optimal clustering result while the current > 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira