Test failure in TDigestTest
Got this error running tests; anyone know what causes this? Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec FAILURE! - in org.apache.mahout.math.stats.TDigestTest testSequentialPoints(org.apache.mahout.math.stats.TDigestTest) Time elapsed: 4.674 sec FAILURE! java.lang.AssertionError: expected:0.5 but was:0.49489
[jira] [Comment Edited] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable
[ https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883887#comment-13883887 ] Suneel Marthi edited comment on MAHOUT-1030 at 1/28/14 8:20 AM: Patch committed. Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable Key: MAHOUT-1030 URL: https://issues.apache.org/jira/browse/MAHOUT-1030 Project: Mahout Issue Type: Bug Components: Clustering, Integration Affects Versions: 0.7 Reporter: Jeff Eastman Assignee: Andrew Musselman Fix For: 0.9 Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, Mahout-1030.patch Looks like this won't make it into this build. Pretty widespread impact on code and tests and I don't know which properties were implemented in the old version. I will create a JIRA and post my interim results. On 6/8/12 12:21 PM, Jeff Eastman wrote: That's a reversion that evidently got in when the new ClusterClassificationDriver was introduced. It should be a pretty easy fix and I will see if I can make the change before Paritosh cuts the release bits tonight. On 6/7/12 1:00 PM, Pat Ferrel wrote: It appears that in kmeans the clusteredPoints are now written as WeightedVectorWritable where in mahout 0.6 they were WeightedPropertyVectorWritable? This means that the distance from the centroid is no longer stored here? Why? I hope I'm wrong because that is not a welcome change. How is one to order clustered docs by distance from cluster centroid? I'm sure I could calculate the distance but that would mean looking up the centroid for the cluster id given in the above WeightedVectorWritable, which means iterating through all the clusters for each clustered doc. In my case the number of clusters could be fairly large. Am I missing something? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1310) 100% unit test pass for mahout during build on Windows
[ https://issues.apache.org/jira/browse/MAHOUT-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883893#comment-13883893 ] Hudson commented on MAHOUT-1310: SUCCESS: Integrated in Mahout-Quality #2448 (See [https://builds.apache.org/job/Mahout-Quality/2448/]) MAHOUT-1310: Changed method signatures to remove unused DistanceMeasure parameter. (smarthi: rev 1561975) * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansDriver.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansDriver.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/kmeans/SpectralKMeansDriver.java * /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java * /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterCountReaderTest.java * /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayFuzzyKMeans.java * /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayKMeans.java * /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java * /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java * /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java * /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterEvaluator.java * /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/cdbw/TestCDbwEvaluator.java 100% unit test pass for mahout during build on Windows -- Key: MAHOUT-1310 URL: https://issues.apache.org/jira/browse/MAHOUT-1310 Project: Mahout Issue Type: Task Affects Versions: 0.7 Environment: Operation system: Windows server Reporter: Sergey Svinarchuk Labels: patch Attachments: patchfile.patch Mahout must build on Windows without exception and unit tests must be 100% passed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Test failure in TDigestTest
Was this repeatable ? If transient then the test likely has a variable seed. This test is statistical and may have too tight a tolerance. Sent from my iPhone On Jan 28, 2014, at 0:00, Andrew Musselman andrew.mussel...@gmail.com wrote: Got this error running tests; anyone know what causes this? Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec FAILURE! - in org.apache.mahout.math.stats.TDigestTest testSequentialPoints(org.apache.mahout.math.stats.TDigestTest) Time elapsed: 4.674 sec FAILURE! java.lang.AssertionError: expected:0.5 but was:0.49489
Re: Test failure in TDigestTest
These failures are not repeatable, and had seen this happen a few times. The tolerance margin for this statistical test is presently set at 0.005. I once had a test failure that read: java.lang.AssertionError: expected:0.5 but was:0.50578 Maybe change the fuzzfactor for this test from present 0.005 to 0.006? On Tuesday, January 28, 2014 7:00 AM, Ted Dunning ted.dunn...@gmail.com wrote: Was this repeatable ? If transient then the test likely has a variable seed. This test is statistical and may have too tight a tolerance. Sent from my iPhone On Jan 28, 2014, at 0:00, Andrew Musselman andrew.mussel...@gmail.com wrote: Got this error running tests; anyone know what causes this? Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec FAILURE! - in org.apache.mahout.math.stats.TDigestTest testSequentialPoints(org.apache.mahout.math.stats.TDigestTest) Time elapsed: 4.674 sec FAILURE! java.lang.AssertionError: expected:0.5 but was:0.49489
[jira] [Commented] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable
[ https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884313#comment-13884313 ] Pat Ferrel commented on MAHOUT-1030: Ran it through KMeans, FuzzyKMeans, sequential and m/r, and all are producing distances in the right range. Looks closed to me. Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable Key: MAHOUT-1030 URL: https://issues.apache.org/jira/browse/MAHOUT-1030 Project: Mahout Issue Type: Bug Components: Clustering, Integration Affects Versions: 0.7 Reporter: Jeff Eastman Assignee: Andrew Musselman Fix For: 0.9 Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, Mahout-1030.patch Looks like this won't make it into this build. Pretty widespread impact on code and tests and I don't know which properties were implemented in the old version. I will create a JIRA and post my interim results. On 6/8/12 12:21 PM, Jeff Eastman wrote: That's a reversion that evidently got in when the new ClusterClassificationDriver was introduced. It should be a pretty easy fix and I will see if I can make the change before Paritosh cuts the release bits tonight. On 6/7/12 1:00 PM, Pat Ferrel wrote: It appears that in kmeans the clusteredPoints are now written as WeightedVectorWritable where in mahout 0.6 they were WeightedPropertyVectorWritable? This means that the distance from the centroid is no longer stored here? Why? I hope I'm wrong because that is not a welcome change. How is one to order clustered docs by distance from cluster centroid? I'm sure I could calculate the distance but that would mean looking up the centroid for the cluster id given in the above WeightedVectorWritable, which means iterating through all the clusters for each clustered doc. In my case the number of clusters could be fairly large. Am I missing something? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce
[ https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884332#comment-13884332 ] Nilesh Chakraborty commented on MAHOUT-742: --- Hey [~ssc], I've been working on an implementation of large-scale blockwise matrix-vector multiplication using a single MapReduce job. The current algorithms and implementations need two MapReduce jobs for blockwise multiplication (or any sparse mat-vec mult where the vector isn't stored in memory). I will be using it to implement PageRank. I'll benchmark my implementation against the state-of-the-art in MapReduce-based PageRank - Pegasus (they've contributed the Pegasus code to https://github.com/intel-hadoop/HiBench). If my version turns out to be faster, I'll be writing the code for algorithms like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) too. Do you think these can make for a useful contribution to Mahout? I need to keep that in mind before I go forward with coding. Pagerank implementation in Map/Reduce - Key: MAHOUT-742 URL: https://issues.apache.org/jira/browse/MAHOUT-742 Project: Mahout Issue Type: New Feature Components: Graph Affects Versions: 0.6 Reporter: Christoph Nagel Assignee: Sebastian Schelter Fix For: 0.6 Attachments: MAHOUT-742.patch Hi, my name is Christoph Nagel. I'm student on technical university Berlin and participating on the course of Isabel Drost and Sebastian Schelter. My work is to implement the pagerank-algorithm, where the pagerank-vector fits in memory. For the computation I used the naive algorithm shown in the book 'Mining of Massive Datasets' from Rajaraman Ullman (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf). Matrix- and vector-multiplication are done with mahout methods. Most work is the transformation the input graph, which has to consists of a nodes- and edges file. Format of nodes file: node\n Format of edges file: startNode\tendNode\n Therefore I created the following classes: * LineIndexer: assigns each line an index * EdgesToIndex: indexes the nodes of the edges * EdgesIndexToTransitionMatrix: creates the transition matrix * Pagerank: computes PR from transition matrix * JoinNodesWithPagerank: creates the joined output * PagerankExampleJob: does the complete job Each class has a test (not PagerankExampleJob) and I took the example of the book for evaluating. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: [jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce
Hi Nilesh, We've had an implementation of PageRank, which was removed a few releases ago. I don't think PageRank would be a valuable contribution, because I don't see a MapReduce based implementation being able to compete with systems such as Apache Giraph that also run on a standard Hadoop cluster. If you wanted to work on a block-based version of our matrix multiplication code (that is used in some algorithms such as our existing Lanczos implementation afaik) that would be a very valuable contribution, however. @list Any other opinions on that? Best, Sebastian On 01/28/2014 06:07 PM, Nilesh Chakraborty (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884332#comment-13884332 ] Nilesh Chakraborty commented on MAHOUT-742: --- Hey [~ssc], I've been working on an implementation of large-scale blockwise matrix-vector multiplication using a single MapReduce job. The current algorithms and implementations need two MapReduce jobs for blockwise multiplication (or any sparse mat-vec mult where the vector isn't stored in memory). I will be using it to implement PageRank. I'll benchmark my implementation against the state-of-the-art in MapReduce-based PageRank - Pegasus (they've contributed the Pegasus code to https://github.com/intel-hadoop/HiBench). If my version turns out to be faster, I'll be writing the code for algorithms like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) too. Do you think these can make for a useful contribution to Mahout? I need to keep that in mind before I go forward with coding. Pagerank implementation in Map/Reduce - Key: MAHOUT-742 URL: https://issues.apache.org/jira/browse/MAHOUT-742 Project: Mahout Issue Type: New Feature Components: Graph Affects Versions: 0.6 Reporter: Christoph Nagel Assignee: Sebastian Schelter Fix For: 0.6 Attachments: MAHOUT-742.patch Hi, my name is Christoph Nagel. I'm student on technical university Berlin and participating on the course of Isabel Drost and Sebastian Schelter. My work is to implement the pagerank-algorithm, where the pagerank-vector fits in memory. For the computation I used the naive algorithm shown in the book 'Mining of Massive Datasets' from Rajaraman Ullman (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf). Matrix- and vector-multiplication are done with mahout methods. Most work is the transformation the input graph, which has to consists of a nodes- and edges file. Format of nodes file: node\n Format of edges file: startNode\tendNode\n Therefore I created the following classes: * LineIndexer: assigns each line an index * EdgesToIndex: indexes the nodes of the edges * EdgesIndexToTransitionMatrix: creates the transition matrix * Pagerank: computes PR from transition matrix * JoinNodesWithPagerank: creates the joined output * PagerankExampleJob: does the complete job Each class has a test (not PagerankExampleJob) and I took the example of the book for evaluating. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-742) Pagerank implementation in Map/Reduce
[ https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884332#comment-13884332 ] Nilesh Chakraborty edited comment on MAHOUT-742 at 1/28/14 5:08 PM: Hey [~ssc]! I've been working on an implementation of large-scale blockwise matrix-vector multiplication using a single MapReduce job. The current algorithms and implementations need two MapReduce jobs for blockwise multiplication (or any sparse mat-vec mult where the vector isn't stored in memory). I will be using it to implement PageRank. I'll benchmark my implementation against the state-of-the-art in MapReduce-based PageRank - Pegasus (they've contributed the Pegasus code to https://github.com/intel-hadoop/HiBench). If my version turns out to be faster, I'll be writing the code for algorithms like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) too. Do you think these can make for a useful contribution to Mahout? I need to keep that in mind before I go forward with coding. was (Author: nileshc): Hey [~ssc], I've been working on an implementation of large-scale blockwise matrix-vector multiplication using a single MapReduce job. The current algorithms and implementations need two MapReduce jobs for blockwise multiplication (or any sparse mat-vec mult where the vector isn't stored in memory). I will be using it to implement PageRank. I'll benchmark my implementation against the state-of-the-art in MapReduce-based PageRank - Pegasus (they've contributed the Pegasus code to https://github.com/intel-hadoop/HiBench). If my version turns out to be faster, I'll be writing the code for algorithms like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) too. Do you think these can make for a useful contribution to Mahout? I need to keep that in mind before I go forward with coding. Pagerank implementation in Map/Reduce - Key: MAHOUT-742 URL: https://issues.apache.org/jira/browse/MAHOUT-742 Project: Mahout Issue Type: New Feature Components: Graph Affects Versions: 0.6 Reporter: Christoph Nagel Assignee: Sebastian Schelter Fix For: 0.6 Attachments: MAHOUT-742.patch Hi, my name is Christoph Nagel. I'm student on technical university Berlin and participating on the course of Isabel Drost and Sebastian Schelter. My work is to implement the pagerank-algorithm, where the pagerank-vector fits in memory. For the computation I used the naive algorithm shown in the book 'Mining of Massive Datasets' from Rajaraman Ullman (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf). Matrix- and vector-multiplication are done with mahout methods. Most work is the transformation the input graph, which has to consists of a nodes- and edges file. Format of nodes file: node\n Format of edges file: startNode\tendNode\n Therefore I created the following classes: * LineIndexer: assigns each line an index * EdgesToIndex: indexes the nodes of the edges * EdgesIndexToTransitionMatrix: creates the transition matrix * Pagerank: computes PR from transition matrix * JoinNodesWithPagerank: creates the joined output * PagerankExampleJob: does the complete job Each class has a test (not PagerankExampleJob) and I took the example of the book for evaluating. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce
[ https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884338#comment-13884338 ] Nilesh Chakraborty commented on MAHOUT-742: --- Also, what do I need know about Mahout's policy for using 3rd party linear algebra libraries like MTJ, Colt etc.? Say if I need to borrow a lot of functionality from one such library, do I need to rewrite the code so as to eliminate any dependencies on such libraries? What about Apache Commons? I'd also appreciate it if you could give me some pointers/resources where such guidelines are detailed. Pagerank implementation in Map/Reduce - Key: MAHOUT-742 URL: https://issues.apache.org/jira/browse/MAHOUT-742 Project: Mahout Issue Type: New Feature Components: Graph Affects Versions: 0.6 Reporter: Christoph Nagel Assignee: Sebastian Schelter Fix For: 0.6 Attachments: MAHOUT-742.patch Hi, my name is Christoph Nagel. I'm student on technical university Berlin and participating on the course of Isabel Drost and Sebastian Schelter. My work is to implement the pagerank-algorithm, where the pagerank-vector fits in memory. For the computation I used the naive algorithm shown in the book 'Mining of Massive Datasets' from Rajaraman Ullman (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf). Matrix- and vector-multiplication are done with mahout methods. Most work is the transformation the input graph, which has to consists of a nodes- and edges file. Format of nodes file: node\n Format of edges file: startNode\tendNode\n Therefore I created the following classes: * LineIndexer: assigns each line an index * EdgesToIndex: indexes the nodes of the edges * EdgesIndexToTransitionMatrix: creates the transition matrix * Pagerank: computes PR from transition matrix * JoinNodesWithPagerank: creates the joined output * PagerankExampleJob: does the complete job Each class has a test (not PagerankExampleJob) and I took the example of the book for evaluating. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce
[ https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884344#comment-13884344 ] Sebastian Schelter commented on MAHOUT-742: --- Mahout also contains a math library, first you should check whether it already contains what you need :) Pagerank implementation in Map/Reduce - Key: MAHOUT-742 URL: https://issues.apache.org/jira/browse/MAHOUT-742 Project: Mahout Issue Type: New Feature Components: Graph Affects Versions: 0.6 Reporter: Christoph Nagel Assignee: Sebastian Schelter Fix For: 0.6 Attachments: MAHOUT-742.patch Hi, my name is Christoph Nagel. I'm student on technical university Berlin and participating on the course of Isabel Drost and Sebastian Schelter. My work is to implement the pagerank-algorithm, where the pagerank-vector fits in memory. For the computation I used the naive algorithm shown in the book 'Mining of Massive Datasets' from Rajaraman Ullman (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf). Matrix- and vector-multiplication are done with mahout methods. Most work is the transformation the input graph, which has to consists of a nodes- and edges file. Format of nodes file: node\n Format of edges file: startNode\tendNode\n Therefore I created the following classes: * LineIndexer: assigns each line an index * EdgesToIndex: indexes the nodes of the edges * EdgesIndexToTransitionMatrix: creates the transition matrix * Pagerank: computes PR from transition matrix * JoinNodesWithPagerank: creates the joined output * PagerankExampleJob: does the complete job Each class has a test (not PagerankExampleJob) and I took the example of the book for evaluating. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Test failure in TDigestTest
yes. Increasing the threshold is a good idea. On Tue, Jan 28, 2014 at 7:08 AM, Suneel Marthi suneel_mar...@yahoo.comwrote: These failures are not repeatable, and had seen this happen a few times. The tolerance margin for this statistical test is presently set at 0.005. I once had a test failure that read: java.lang.AssertionError: expected:0.5 but was:0.50578 Maybe change the fuzzfactor for this test from present 0.005 to 0.006? On Tuesday, January 28, 2014 7:00 AM, Ted Dunning ted.dunn...@gmail.com wrote: Was this repeatable ? If transient then the test likely has a variable seed. This test is statistical and may have too tight a tolerance. Sent from my iPhone On Jan 28, 2014, at 0:00, Andrew Musselman andrew.mussel...@gmail.com wrote: Got this error running tests; anyone know what causes this? Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec FAILURE! - in org.apache.mahout.math.stats.TDigestTest testSequentialPoints(org.apache.mahout.math.stats.TDigestTest) Time elapsed: 4.674 sec FAILURE! java.lang.AssertionError: expected:0.5 but was:0.49489
[jira] [Created] (MAHOUT-1411) Random test failures from TDigestTest
Suneel Marthi created MAHOUT-1411: - Summary: Random test failures from TDigestTest Key: MAHOUT-1411 URL: https://issues.apache.org/jira/browse/MAHOUT-1411 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Priority: Minor Fix For: 0.9 Seeing random test failures like below from TDigestTest. These errors are not repeatable. {Code} testUniform(org.apache.mahout.math.stats.TDigestTest) Time elapsed: 0.356 sec FAILURE! java.lang.AssertionError: expected:0.5 but was:0.50578 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:494) at org.junit.Assert.assertEquals(Assert.java:592) at org.apache.mahout.math.stats.TDigestTest.runTest(TDigestTest.java:373) at org.apache.mahout.math.stats.TDigestTest.testUniform(TDigestTest.java:79) Results : Failed tests: TDigestTest.testUniform:79-runTest:373 expected:0.5 but was:0.50578 {Code} {Code} Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec FAILURE! - in org.apache.mahout.math.stats.TDigestTest testSequentialPoints(org.apache.mahout.math.stats.TDigestTest) Time elapsed: 4.674 sec FAILURE! java.lang.AssertionError: expected:0.5 but was:0.49489 {Code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (MAHOUT-1411) Random test failures from TDigestTest
[ https://issues.apache.org/jira/browse/MAHOUT-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi resolved MAHOUT-1411. --- Resolution: Fixed Increased the tolerance limit to 0.006 (from present 0.005) to handle the random test failures. Random test failures from TDigestTest - Key: MAHOUT-1411 URL: https://issues.apache.org/jira/browse/MAHOUT-1411 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Priority: Minor Fix For: 0.9 Seeing random test failures like below from TDigestTest. These errors are not repeatable. {Code} testUniform(org.apache.mahout.math.stats.TDigestTest) Time elapsed: 0.356 sec FAILURE! java.lang.AssertionError: expected:0.5 but was:0.50578 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:494) at org.junit.Assert.assertEquals(Assert.java:592) at org.apache.mahout.math.stats.TDigestTest.runTest(TDigestTest.java:373) at org.apache.mahout.math.stats.TDigestTest.testUniform(TDigestTest.java:79) Results : Failed tests: TDigestTest.testUniform:79-runTest:373 expected:0.5 but was:0.50578 {Code} {Code} Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec FAILURE! - in org.apache.mahout.math.stats.TDigestTest testSequentialPoints(org.apache.mahout.math.stats.TDigestTest) Time elapsed: 4.674 sec FAILURE! java.lang.AssertionError: expected:0.5 but was:0.49489 {Code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce
[ https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884433#comment-13884433 ] Nilesh Chakraborty commented on MAHOUT-742: --- My bad, didn't know a lot about Mahout org.apache.mahout.math.Matrix and her friends were so full-featured. Thanks. Then it shouldn't be any problem. :-) Actually I had come across #MAHOUT-879 (Remove all graph algorithms with the exception of PageRank) and was just checking with you if large-scale sparse mat-vec mult and PageRank implementations in MapReduce are welcome. Pagerank implementation in Map/Reduce - Key: MAHOUT-742 URL: https://issues.apache.org/jira/browse/MAHOUT-742 Project: Mahout Issue Type: New Feature Components: Graph Affects Versions: 0.6 Reporter: Christoph Nagel Assignee: Sebastian Schelter Fix For: 0.6 Attachments: MAHOUT-742.patch Hi, my name is Christoph Nagel. I'm student on technical university Berlin and participating on the course of Isabel Drost and Sebastian Schelter. My work is to implement the pagerank-algorithm, where the pagerank-vector fits in memory. For the computation I used the naive algorithm shown in the book 'Mining of Massive Datasets' from Rajaraman Ullman (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf). Matrix- and vector-multiplication are done with mahout methods. Most work is the transformation the input graph, which has to consists of a nodes- and edges file. Format of nodes file: node\n Format of edges file: startNode\tendNode\n Therefore I created the following classes: * LineIndexer: assigns each line an index * EdgesToIndex: indexes the nodes of the edges * EdgesIndexToTransitionMatrix: creates the transition matrix * Pagerank: computes PR from transition matrix * JoinNodesWithPagerank: creates the joined output * PagerankExampleJob: does the complete job Each class has a test (not PagerankExampleJob) and I took the example of the book for evaluating. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-742) Pagerank implementation in Map/Reduce
[ https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884433#comment-13884433 ] Nilesh Chakraborty edited comment on MAHOUT-742 at 1/28/14 7:00 PM: My bad, didn't know that Mahout org.apache.mahout.math.Matrix and her friends were so full-featured. Thanks. Then it shouldn't be any problem. :-) Actually I had come across #MAHOUT-879 (Remove all graph algorithms with the exception of PageRank) and was just checking with you if large-scale sparse mat-vec mult and PageRank implementations in MapReduce are welcome. was (Author: nileshc): My bad, didn't know a lot about Mahout org.apache.mahout.math.Matrix and her friends were so full-featured. Thanks. Then it shouldn't be any problem. :-) Actually I had come across #MAHOUT-879 (Remove all graph algorithms with the exception of PageRank) and was just checking with you if large-scale sparse mat-vec mult and PageRank implementations in MapReduce are welcome. Pagerank implementation in Map/Reduce - Key: MAHOUT-742 URL: https://issues.apache.org/jira/browse/MAHOUT-742 Project: Mahout Issue Type: New Feature Components: Graph Affects Versions: 0.6 Reporter: Christoph Nagel Assignee: Sebastian Schelter Fix For: 0.6 Attachments: MAHOUT-742.patch Hi, my name is Christoph Nagel. I'm student on technical university Berlin and participating on the course of Isabel Drost and Sebastian Schelter. My work is to implement the pagerank-algorithm, where the pagerank-vector fits in memory. For the computation I used the naive algorithm shown in the book 'Mining of Massive Datasets' from Rajaraman Ullman (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf). Matrix- and vector-multiplication are done with mahout methods. Most work is the transformation the input graph, which has to consists of a nodes- and edges file. Format of nodes file: node\n Format of edges file: startNode\tendNode\n Therefore I created the following classes: * LineIndexer: assigns each line an index * EdgesToIndex: indexes the nodes of the edges * EdgesIndexToTransitionMatrix: creates the transition matrix * Pagerank: computes PR from transition matrix * JoinNodesWithPagerank: creates the joined output * PagerankExampleJob: does the complete job Each class has a test (not PagerankExampleJob) and I took the example of the book for evaluating. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1411) Random test failures from TDigestTest
[ https://issues.apache.org/jira/browse/MAHOUT-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884473#comment-13884473 ] Hudson commented on MAHOUT-1411: SUCCESS: Integrated in Mahout-Quality #2449 (See [https://builds.apache.org/job/Mahout-Quality/2449/]) MAHOUT-1411: Random test failures from TDigestTest (smarthi: rev 1562146) * /mahout/trunk/CHANGELOG * /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/TDigestTest.java Random test failures from TDigestTest - Key: MAHOUT-1411 URL: https://issues.apache.org/jira/browse/MAHOUT-1411 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Priority: Minor Fix For: 0.9 Seeing random test failures like below from TDigestTest. These errors are not repeatable. {Code} testUniform(org.apache.mahout.math.stats.TDigestTest) Time elapsed: 0.356 sec FAILURE! java.lang.AssertionError: expected:0.5 but was:0.50578 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:494) at org.junit.Assert.assertEquals(Assert.java:592) at org.apache.mahout.math.stats.TDigestTest.runTest(TDigestTest.java:373) at org.apache.mahout.math.stats.TDigestTest.testUniform(TDigestTest.java:79) Results : Failed tests: TDigestTest.testUniform:79-runTest:373 expected:0.5 but was:0.50578 {Code} {Code} Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec FAILURE! - in org.apache.mahout.math.stats.TDigestTest testSequentialPoints(org.apache.mahout.math.stats.TDigestTest) Time elapsed: 4.674 sec FAILURE! java.lang.AssertionError: expected:0.5 but was:0.49489 {Code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (MAHOUT-1021) Blank csv input file given to Canopy/Kmeans clustering
[ https://issues.apache.org/jira/browse/MAHOUT-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi resolved MAHOUT-1021. --- Resolution: Won't Fix This issue was reported against Mahout 0.6, the codebase has since changed and the reported issue cannot be reproduced against the present trunk. Resolving this as 'Won't Fix'. Blank csv input file given to Canopy/Kmeans clustering -- Key: MAHOUT-1021 URL: https://issues.apache.org/jira/browse/MAHOUT-1021 Project: Mahout Issue Type: Improvement Components: Integration Affects Versions: 0.6 Environment: Mahout 0.6 version on hadoop 0.2, Testing on HadooponAzure platform Reporter: Nabarun Sengupta Assignee: Suneel Marthi Priority: Minor Fix For: Backlog Hi, This is regarding a bug that we observed in Canopy clustering. We could reflect the same in Kmeans too. Given a blank csv input file, we observe the algorithm executes two jobs, during the third job execution, it throws an error. When I tried to execute a malformed csv file with decimal or characters, I received an error during the first job itself. Therefore, I feel the same validation should be done if the input file is blank and exception should be thrown during the first job execution. Following is the job execution details: Apps\dist\mahout\examples\binbuild-cluster-syntheticcontrol.cmd ease select a number to choose the corresponding clustering algorithm canopy clustering kmeans clustering fuzzykmeans clustering dirichlet clustering meanshift clustering er your choice:1 . You chose 1 and we'll use canopy Clustering S is healthy... loading Synthetic control data to HDFS eted hdfs://10.114.251.23:9000/user/milind/testdata ccessfully Uploaded Synthetic control data to HDFS nning on hadoop, using HADOOP_HOME=c:\Apps\dist Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver org.apache.mah ontrol.canopy.Job 05/17 10:46:11 WARN driver.MahoutDriver: No org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props found on classpath rguments only 05/17 10:46:11 INFO canopy.Job: Running with default arguments 05/17 10:46:12 INFO common.HadoopUtil: Deleting output 05/17 10:46:12 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool 05/17 10:46:13 INFO input.FileInputFormat: Total input paths to process : 1 05/17 10:46:14 INFO mapred.JobClient: Running job: job_201205170655_0017 05/17 10:46:15 INFO mapred.JobClient: map 0% reduce 0% 05/17 10:46:48 INFO mapred.JobClient: map 100% reduce 0% 05/17 10:46:59 INFO mapred.JobClient: Job complete: job_201205170655_0017 05/17 10:46:59 INFO mapred.JobClient: Counters: 15 05/17 10:46:59 INFO mapred.JobClient: Job Counters 05/17 10:46:59 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=29672 05/17 10:46:59 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 05/17 10:46:59 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 05/17 10:46:59 INFO mapred.JobClient: Launched map tasks=1 05/17 10:46:59 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 05/17 10:46:59 INFO mapred.JobClient: File Output Format Counters 05/17 10:46:59 INFO mapred.JobClient: Bytes Written=90 05/17 10:46:59 INFO mapred.JobClient: FileSystemCounters 05/17 10:46:59 INFO mapred.JobClient: FILE_BYTES_READ=130 05/17 10:46:59 INFO mapred.JobClient: HDFS_BYTES_READ=134 05/17 10:46:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21557 05/17 10:46:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=90 05/17 10:46:59 INFO mapred.JobClient: File Input Format Counters 05/17 10:46:59 INFO mapred.JobClient: Bytes Read=0 05/17 10:46:59 INFO mapred.JobClient: Map-Reduce Framework 05/17 10:46:59 INFO mapred.JobClient: Map input records=0 05/17 10:46:59 INFO mapred.JobClient: Spilled Records=0 05/17 10:46:59 INFO mapred.JobClient: Map output records=0 05/17 10:46:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=134 05/17 10:46:59 INFO canopy.CanopyDriver: Build Clusters Input: output/data Out: output Measure: org.apache.mahout.common.dist sure@6eedf759 t1: 80.0 t2: 55.0 05/17 10:46:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool 05/17 10:46:59 INFO input.FileInputFormat: Total input paths to process : 1 05/17 10:47:00 INFO mapred.JobClient: Running job: job_201205170655_0018 05/17 10:47:01 INFO mapred.JobClient: map 0% reduce 0% 05/17 10:47:33 INFO mapred.JobClient: map 100% reduce 0% 05/17 10:47:51 INFO mapred.JobClient: map 100% reduce 100% 05/17 10:48:02 INFO
Mahout 0.9 Release
Fixed the issues that were reported with Clustering code this past week, upgraded codebase to Lucene 4.6.1 that was released today. Here's the URL for the 0.9 release in staging:- https://repository.apache.org/content/repositories/orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/ The artifacts have been signed with the following key: https://people.apache.org/keys/committer/smarthi.asc Please:- a) Verify that u can unpack the release (tar or zip) b) Verify u r able to compile the distro c) Run through the unit tests: mvn clean test d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through all the different options in each script. Need a minimum of 3 '+1' votes from PMC for the release to be finalized.
RE: Mahout 0.9 Release
a), b), c), d) all passed here. CosineDistance of clustered points from cluster-reuters.sh -1 kmeans were within the range [0,1]. Date: Tue, 28 Jan 2014 16:45:42 -0800 From: suneel_mar...@yahoo.com Subject: Mahout 0.9 Release To: u...@mahout.apache.org; dev@mahout.apache.org Fixed the issues that were reported with Clustering code this past week, upgraded codebase to Lucene 4.6.1 that was released today. Here's the URL for the 0.9 release in staging:- https://repository.apache.org/content/repositories/orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/ The artifacts have been signed with the following key: https://people.apache.org/keys/committer/smarthi.asc Please:- a) Verify that u can unpack the release (tar or zip) b) Verify u r able to compile the distro c) Run through the unit tests: mvn clean test d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through all the different options in each script. Need a minimum of 3 '+1' votes from PMC for the release to be finalized.
Re: Mahout 0.9 Release
Looks good. +1 On Tue, Jan 28, 2014 at 8:07 PM, Andrew Palumbo ap@outlook.com wrote: a), b), c), d) all passed here. CosineDistance of clustered points from cluster-reuters.sh -1 kmeans were within the range [0,1]. Date: Tue, 28 Jan 2014 16:45:42 -0800 From: suneel_mar...@yahoo.com Subject: Mahout 0.9 Release To: u...@mahout.apache.org; dev@mahout.apache.org Fixed the issues that were reported with Clustering code this past week, upgraded codebase to Lucene 4.6.1 that was released today. Here's the URL for the 0.9 release in staging:- https://repository.apache.org/content/repositories/orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/ The artifacts have been signed with the following key: https://people.apache.org/keys/committer/smarthi.asc Please:- a) Verify that u can unpack the release (tar or zip) b) Verify u r able to compile the distro c) Run through the unit tests: mvn clean test d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through all the different options in each script. Need a minimum of 3 '+1' votes from PMC for the release to be finalized.