Test failure in TDigestTest

2014-01-28 Thread Andrew Musselman
Got this error running tests; anyone know what causes this?

Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec
 FAILURE! - in org.apache.mahout.math.stats.TDigestTest
testSequentialPoints(org.apache.mahout.math.stats.TDigestTest)  Time
elapsed: 4.674 sec   FAILURE!
java.lang.AssertionError: expected:0.5 but was:0.49489


[jira] [Comment Edited] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

2014-01-28 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883887#comment-13883887
 ] 

Suneel Marthi edited comment on MAHOUT-1030 at 1/28/14 8:20 AM:


Patch committed.



 Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
 WeightedVectorWritable
 

 Key: MAHOUT-1030
 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
 Project: Mahout
  Issue Type: Bug
  Components: Clustering, Integration
Affects Versions: 0.7
Reporter: Jeff Eastman
Assignee: Andrew Musselman
 Fix For: 0.9

 Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, 
 MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, 
 Mahout-1030.patch


 Looks like this won't make it into this build. Pretty widespread impact on 
 code and tests and I don't know which properties were implemented in the old 
 version. I will create a JIRA and post my interim results.
 On 6/8/12 12:21 PM, Jeff Eastman wrote:
  That's a reversion that evidently got in when the new 
  ClusterClassificationDriver was introduced. It should be a pretty easy fix 
  and I will see if I can make the change before Paritosh cuts the release 
  bits tonight.
 
  On 6/7/12 1:00 PM, Pat Ferrel wrote:
  It appears that in kmeans the clusteredPoints are now written as 
  WeightedVectorWritable where in mahout 0.6 they were 
  WeightedPropertyVectorWritable? This means that the distance from the 
  centroid is no longer stored here? Why? I hope I'm wrong because that is 
  not a welcome change. How is one to order clustered docs by distance from 
  cluster centroid?
 
  I'm sure I could calculate the distance but that would mean looking up the 
  centroid for the cluster id given in the above WeightedVectorWritable, 
  which means iterating through all the clusters for each clustered doc. In 
  my case the number of clusters could be fairly large.
 
  Am I missing something?
 
 
 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1310) 100% unit test pass for mahout during build on Windows

2014-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883893#comment-13883893
 ] 

Hudson commented on MAHOUT-1310:


SUCCESS: Integrated in Mahout-Quality #2448 (See 
[https://builds.apache.org/job/Mahout-Quality/2448/])
MAHOUT-1310: Changed method signatures to remove unused DistanceMeasure 
parameter. (smarthi: rev 1561975)
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansDriver.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansDriver.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/kmeans/SpectralKMeansDriver.java
* 
/mahout/trunk/core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java
* 
/mahout/trunk/core/src/test/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterCountReaderTest.java
* 
/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayFuzzyKMeans.java
* 
/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayKMeans.java
* 
/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
* 
/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
* 
/mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
* 
/mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterEvaluator.java
* 
/mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/cdbw/TestCDbwEvaluator.java


 100% unit test pass for mahout during build on Windows
 --

 Key: MAHOUT-1310
 URL: https://issues.apache.org/jira/browse/MAHOUT-1310
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.7
 Environment: Operation system: Windows server
Reporter: Sergey Svinarchuk
  Labels: patch
 Attachments: patchfile.patch


 Mahout must build on Windows without exception and unit tests must be 100% 
 passed.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Test failure in TDigestTest

2014-01-28 Thread Ted Dunning
Was this repeatable ?  

If transient then the test likely has a variable seed.  This test is 
statistical and may have too tight a tolerance. 

Sent from my iPhone

 On Jan 28, 2014, at 0:00, Andrew Musselman andrew.mussel...@gmail.com wrote:
 
 Got this error running tests; anyone know what causes this?
 
 Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec
  FAILURE! - in org.apache.mahout.math.stats.TDigestTest
 testSequentialPoints(org.apache.mahout.math.stats.TDigestTest)  Time
 elapsed: 4.674 sec   FAILURE!
 java.lang.AssertionError: expected:0.5 but was:0.49489


Re: Test failure in TDigestTest

2014-01-28 Thread Suneel Marthi
These failures are not repeatable, and had seen this happen a few times. 
The tolerance margin for this statistical test is presently set at 0.005.

I once had a test failure that read:

 java.lang.AssertionError: expected:0.5 but was:0.50578

Maybe change the fuzzfactor for this test from present 0.005 to 0.006?



On Tuesday, January 28, 2014 7:00 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
Was this repeatable ?  

If transient then the test likely has a variable seed.  This test is 
statistical and may have too tight a tolerance. 

Sent from my iPhone


 On Jan 28, 2014, at 0:00, Andrew Musselman andrew.mussel...@gmail.com wrote:
 
 Got this error running tests; anyone know what causes this?
 
 Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec
  FAILURE! - in org.apache.mahout.math.stats.TDigestTest
 testSequentialPoints(org.apache.mahout.math.stats.TDigestTest)  Time
 elapsed: 4.674 sec   FAILURE!
 java.lang.AssertionError: expected:0.5 but was:0.49489

[jira] [Commented] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

2014-01-28 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884313#comment-13884313
 ] 

Pat Ferrel commented on MAHOUT-1030:


Ran it through KMeans, FuzzyKMeans, sequential and m/r, and all are producing 
distances in the right range.

Looks closed to me.

 Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
 WeightedVectorWritable
 

 Key: MAHOUT-1030
 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
 Project: Mahout
  Issue Type: Bug
  Components: Clustering, Integration
Affects Versions: 0.7
Reporter: Jeff Eastman
Assignee: Andrew Musselman
 Fix For: 0.9

 Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, 
 MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, 
 Mahout-1030.patch


 Looks like this won't make it into this build. Pretty widespread impact on 
 code and tests and I don't know which properties were implemented in the old 
 version. I will create a JIRA and post my interim results.
 On 6/8/12 12:21 PM, Jeff Eastman wrote:
  That's a reversion that evidently got in when the new 
  ClusterClassificationDriver was introduced. It should be a pretty easy fix 
  and I will see if I can make the change before Paritosh cuts the release 
  bits tonight.
 
  On 6/7/12 1:00 PM, Pat Ferrel wrote:
  It appears that in kmeans the clusteredPoints are now written as 
  WeightedVectorWritable where in mahout 0.6 they were 
  WeightedPropertyVectorWritable? This means that the distance from the 
  centroid is no longer stored here? Why? I hope I'm wrong because that is 
  not a welcome change. How is one to order clustered docs by distance from 
  cluster centroid?
 
  I'm sure I could calculate the distance but that would mean looking up the 
  centroid for the cluster id given in the above WeightedVectorWritable, 
  which means iterating through all the clusters for each clustered doc. In 
  my case the number of clusters could be fairly large.
 
  Am I missing something?
 
 
 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce

2014-01-28 Thread Nilesh Chakraborty (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884332#comment-13884332
 ] 

Nilesh Chakraborty commented on MAHOUT-742:
---

Hey [~ssc], I've been working on an implementation of large-scale blockwise 
matrix-vector multiplication using a single MapReduce job. The current 
algorithms and implementations need two MapReduce jobs for blockwise 
multiplication (or any sparse mat-vec mult where the vector isn't stored in 
memory). I will be using it to implement PageRank. I'll benchmark my 
implementation against the state-of-the-art in MapReduce-based PageRank - 
Pegasus (they've contributed the Pegasus code to 
https://github.com/intel-hadoop/HiBench).

If my version turns out to be faster, I'll be writing the code for algorithms 
like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) 
too.

Do you think these can make for a useful contribution to Mahout? I need to keep 
that in mind before I go forward with coding.

 Pagerank implementation in Map/Reduce
 -

 Key: MAHOUT-742
 URL: https://issues.apache.org/jira/browse/MAHOUT-742
 Project: Mahout
  Issue Type: New Feature
  Components: Graph
Affects Versions: 0.6
Reporter: Christoph Nagel
Assignee: Sebastian Schelter
 Fix For: 0.6

 Attachments: MAHOUT-742.patch


 Hi,
 my name is Christoph Nagel. I'm student on technical university Berlin and 
 participating on the course of Isabel Drost and Sebastian Schelter.
 My work is to implement the pagerank-algorithm, where the pagerank-vector 
 fits in memory.
 For the computation I used the naive algorithm shown in the book 'Mining of 
 Massive Datasets' from Rajaraman  Ullman 
 (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
 Matrix- and vector-multiplication are done with mahout methods.
 Most work is the transformation the input graph, which has to consists of a 
 nodes- and edges file.
 Format of nodes file: node\n
 Format of edges file: startNode\tendNode\n
 Therefore I created the following classes:
 * LineIndexer: assigns each line an index
 * EdgesToIndex: indexes the nodes of the edges
 * EdgesIndexToTransitionMatrix: creates the transition matrix
 * Pagerank: computes PR from transition matrix
 * JoinNodesWithPagerank: creates the joined output
 * PagerankExampleJob: does the complete job
 Each class has a test (not PagerankExampleJob) and I took the example of the 
 book for evaluating.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce

2014-01-28 Thread Sebastian Schelter

Hi Nilesh,

We've had an implementation of PageRank, which was removed a few 
releases ago. I don't think PageRank would be a valuable contribution, 
because I don't see a MapReduce based implementation being able to 
compete with systems such as Apache Giraph that also run on a standard 
Hadoop cluster.


If you wanted to work on a block-based version of our matrix 
multiplication code (that is used in some algorithms such as our 
existing Lanczos implementation afaik) that would be a very valuable 
contribution, however.


@list Any other opinions on that?

Best,
Sebastian

On 01/28/2014 06:07 PM, Nilesh Chakraborty (JIRA) wrote:


 [ 
https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884332#comment-13884332
 ]

Nilesh Chakraborty commented on MAHOUT-742:
---

Hey [~ssc], I've been working on an implementation of large-scale blockwise 
matrix-vector multiplication using a single MapReduce job. The current 
algorithms and implementations need two MapReduce jobs for blockwise 
multiplication (or any sparse mat-vec mult where the vector isn't stored in 
memory). I will be using it to implement PageRank. I'll benchmark my 
implementation against the state-of-the-art in MapReduce-based PageRank - 
Pegasus (they've contributed the Pegasus code to 
https://github.com/intel-hadoop/HiBench).

If my version turns out to be faster, I'll be writing the code for algorithms 
like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) 
too.

Do you think these can make for a useful contribution to Mahout? I need to keep 
that in mind before I go forward with coding.


Pagerank implementation in Map/Reduce
-

 Key: MAHOUT-742
 URL: https://issues.apache.org/jira/browse/MAHOUT-742
 Project: Mahout
  Issue Type: New Feature
  Components: Graph
Affects Versions: 0.6
Reporter: Christoph Nagel
Assignee: Sebastian Schelter
 Fix For: 0.6

 Attachments: MAHOUT-742.patch


Hi,
my name is Christoph Nagel. I'm student on technical university Berlin and 
participating on the course of Isabel Drost and Sebastian Schelter.
My work is to implement the pagerank-algorithm, where the pagerank-vector fits 
in memory.
For the computation I used the naive algorithm shown in the book 'Mining of Massive 
Datasets' from Rajaraman  Ullman 
(http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
Matrix- and vector-multiplication are done with mahout methods.
Most work is the transformation the input graph, which has to consists of a 
nodes- and edges file.
Format of nodes file: node\n
Format of edges file: startNode\tendNode\n
Therefore I created the following classes:
* LineIndexer: assigns each line an index
* EdgesToIndex: indexes the nodes of the edges
* EdgesIndexToTransitionMatrix: creates the transition matrix
* Pagerank: computes PR from transition matrix
* JoinNodesWithPagerank: creates the joined output
* PagerankExampleJob: does the complete job
Each class has a test (not PagerankExampleJob) and I took the example of the 
book for evaluating.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)





[jira] [Comment Edited] (MAHOUT-742) Pagerank implementation in Map/Reduce

2014-01-28 Thread Nilesh Chakraborty (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884332#comment-13884332
 ] 

Nilesh Chakraborty edited comment on MAHOUT-742 at 1/28/14 5:08 PM:


Hey [~ssc]!

I've been working on an implementation of large-scale blockwise matrix-vector 
multiplication using a single MapReduce job. The current algorithms and 
implementations need two MapReduce jobs for blockwise multiplication (or any 
sparse mat-vec mult where the vector isn't stored in memory). I will be using 
it to implement PageRank. I'll benchmark my implementation against the 
state-of-the-art in MapReduce-based PageRank - Pegasus (they've contributed the 
Pegasus code to https://github.com/intel-hadoop/HiBench).

If my version turns out to be faster, I'll be writing the code for algorithms 
like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) 
too.

Do you think these can make for a useful contribution to Mahout? I need to keep 
that in mind before I go forward with coding.


was (Author: nileshc):
Hey [~ssc], I've been working on an implementation of large-scale blockwise 
matrix-vector multiplication using a single MapReduce job. The current 
algorithms and implementations need two MapReduce jobs for blockwise 
multiplication (or any sparse mat-vec mult where the vector isn't stored in 
memory). I will be using it to implement PageRank. I'll benchmark my 
implementation against the state-of-the-art in MapReduce-based PageRank - 
Pegasus (they've contributed the Pegasus code to 
https://github.com/intel-hadoop/HiBench).

If my version turns out to be faster, I'll be writing the code for algorithms 
like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) 
too.

Do you think these can make for a useful contribution to Mahout? I need to keep 
that in mind before I go forward with coding.

 Pagerank implementation in Map/Reduce
 -

 Key: MAHOUT-742
 URL: https://issues.apache.org/jira/browse/MAHOUT-742
 Project: Mahout
  Issue Type: New Feature
  Components: Graph
Affects Versions: 0.6
Reporter: Christoph Nagel
Assignee: Sebastian Schelter
 Fix For: 0.6

 Attachments: MAHOUT-742.patch


 Hi,
 my name is Christoph Nagel. I'm student on technical university Berlin and 
 participating on the course of Isabel Drost and Sebastian Schelter.
 My work is to implement the pagerank-algorithm, where the pagerank-vector 
 fits in memory.
 For the computation I used the naive algorithm shown in the book 'Mining of 
 Massive Datasets' from Rajaraman  Ullman 
 (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
 Matrix- and vector-multiplication are done with mahout methods.
 Most work is the transformation the input graph, which has to consists of a 
 nodes- and edges file.
 Format of nodes file: node\n
 Format of edges file: startNode\tendNode\n
 Therefore I created the following classes:
 * LineIndexer: assigns each line an index
 * EdgesToIndex: indexes the nodes of the edges
 * EdgesIndexToTransitionMatrix: creates the transition matrix
 * Pagerank: computes PR from transition matrix
 * JoinNodesWithPagerank: creates the joined output
 * PagerankExampleJob: does the complete job
 Each class has a test (not PagerankExampleJob) and I took the example of the 
 book for evaluating.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce

2014-01-28 Thread Nilesh Chakraborty (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884338#comment-13884338
 ] 

Nilesh Chakraborty commented on MAHOUT-742:
---

Also, what do I need know about Mahout's policy for using 3rd party linear 
algebra libraries like MTJ, Colt etc.? Say if I need to borrow a lot of 
functionality from one such library, do I need to rewrite the code so as to 
eliminate any dependencies on such libraries? What about Apache Commons? I'd 
also appreciate it if you could give me some pointers/resources where such 
guidelines are detailed.

 Pagerank implementation in Map/Reduce
 -

 Key: MAHOUT-742
 URL: https://issues.apache.org/jira/browse/MAHOUT-742
 Project: Mahout
  Issue Type: New Feature
  Components: Graph
Affects Versions: 0.6
Reporter: Christoph Nagel
Assignee: Sebastian Schelter
 Fix For: 0.6

 Attachments: MAHOUT-742.patch


 Hi,
 my name is Christoph Nagel. I'm student on technical university Berlin and 
 participating on the course of Isabel Drost and Sebastian Schelter.
 My work is to implement the pagerank-algorithm, where the pagerank-vector 
 fits in memory.
 For the computation I used the naive algorithm shown in the book 'Mining of 
 Massive Datasets' from Rajaraman  Ullman 
 (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
 Matrix- and vector-multiplication are done with mahout methods.
 Most work is the transformation the input graph, which has to consists of a 
 nodes- and edges file.
 Format of nodes file: node\n
 Format of edges file: startNode\tendNode\n
 Therefore I created the following classes:
 * LineIndexer: assigns each line an index
 * EdgesToIndex: indexes the nodes of the edges
 * EdgesIndexToTransitionMatrix: creates the transition matrix
 * Pagerank: computes PR from transition matrix
 * JoinNodesWithPagerank: creates the joined output
 * PagerankExampleJob: does the complete job
 Each class has a test (not PagerankExampleJob) and I took the example of the 
 book for evaluating.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce

2014-01-28 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884344#comment-13884344
 ] 

Sebastian Schelter commented on MAHOUT-742:
---

Mahout also contains a math library, first you should check whether it 
already contains what you need :)




 Pagerank implementation in Map/Reduce
 -

 Key: MAHOUT-742
 URL: https://issues.apache.org/jira/browse/MAHOUT-742
 Project: Mahout
  Issue Type: New Feature
  Components: Graph
Affects Versions: 0.6
Reporter: Christoph Nagel
Assignee: Sebastian Schelter
 Fix For: 0.6

 Attachments: MAHOUT-742.patch


 Hi,
 my name is Christoph Nagel. I'm student on technical university Berlin and 
 participating on the course of Isabel Drost and Sebastian Schelter.
 My work is to implement the pagerank-algorithm, where the pagerank-vector 
 fits in memory.
 For the computation I used the naive algorithm shown in the book 'Mining of 
 Massive Datasets' from Rajaraman  Ullman 
 (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
 Matrix- and vector-multiplication are done with mahout methods.
 Most work is the transformation the input graph, which has to consists of a 
 nodes- and edges file.
 Format of nodes file: node\n
 Format of edges file: startNode\tendNode\n
 Therefore I created the following classes:
 * LineIndexer: assigns each line an index
 * EdgesToIndex: indexes the nodes of the edges
 * EdgesIndexToTransitionMatrix: creates the transition matrix
 * Pagerank: computes PR from transition matrix
 * JoinNodesWithPagerank: creates the joined output
 * PagerankExampleJob: does the complete job
 Each class has a test (not PagerankExampleJob) and I took the example of the 
 book for evaluating.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Test failure in TDigestTest

2014-01-28 Thread Ted Dunning
yes.  Increasing the threshold is a good idea.


On Tue, Jan 28, 2014 at 7:08 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 These failures are not repeatable, and had seen this happen a few times.
 The tolerance margin for this statistical test is presently set at 0.005.

 I once had a test failure that read:

  java.lang.AssertionError: expected:0.5 but was:0.50578

 Maybe change the fuzzfactor for this test from present 0.005 to 0.006?



 On Tuesday, January 28, 2014 7:00 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 Was this repeatable ?

 If transient then the test likely has a variable seed.  This test is
 statistical and may have too tight a tolerance.

 Sent from my iPhone


  On Jan 28, 2014, at 0:00, Andrew Musselman andrew.mussel...@gmail.com
 wrote:
 
  Got this error running tests; anyone know what causes this?
 
  Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063
 sec
   FAILURE! - in org.apache.mahout.math.stats.TDigestTest
  testSequentialPoints(org.apache.mahout.math.stats.TDigestTest)  Time
  elapsed: 4.674 sec   FAILURE!
  java.lang.AssertionError: expected:0.5 but was:0.49489



[jira] [Created] (MAHOUT-1411) Random test failures from TDigestTest

2014-01-28 Thread Suneel Marthi (JIRA)
Suneel Marthi created MAHOUT-1411:
-

 Summary: Random test failures from TDigestTest
 Key: MAHOUT-1411
 URL: https://issues.apache.org/jira/browse/MAHOUT-1411
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Priority: Minor
 Fix For: 0.9


Seeing random test failures like below from TDigestTest. These errors are not 
repeatable. 

{Code}

testUniform(org.apache.mahout.math.stats.TDigestTest)  Time elapsed: 0.356 sec  
 FAILURE!
java.lang.AssertionError: expected:0.5 but was:0.50578
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:494)
at org.junit.Assert.assertEquals(Assert.java:592)
at 
org.apache.mahout.math.stats.TDigestTest.runTest(TDigestTest.java:373)
at 
org.apache.mahout.math.stats.TDigestTest.testUniform(TDigestTest.java:79)

Results :

Failed tests: 
  TDigestTest.testUniform:79-runTest:373 expected:0.5 but was:0.50578

{Code}

{Code}
Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec
 FAILURE! - in org.apache.mahout.math.stats.TDigestTest
testSequentialPoints(org.apache.mahout.math.stats.TDigestTest)  Time
elapsed: 4.674 sec   FAILURE!
java.lang.AssertionError: expected:0.5 but was:0.49489
{Code}





--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (MAHOUT-1411) Random test failures from TDigestTest

2014-01-28 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi resolved MAHOUT-1411.
---

Resolution: Fixed

Increased the tolerance limit to 0.006 (from present 0.005) to handle the 
random test failures.

 Random test failures from TDigestTest
 -

 Key: MAHOUT-1411
 URL: https://issues.apache.org/jira/browse/MAHOUT-1411
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Priority: Minor
 Fix For: 0.9


 Seeing random test failures like below from TDigestTest. These errors are not 
 repeatable. 
 {Code}
 testUniform(org.apache.mahout.math.stats.TDigestTest)  Time elapsed: 0.356 
 sec   FAILURE!
 java.lang.AssertionError: expected:0.5 but was:0.50578
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:494)
   at org.junit.Assert.assertEquals(Assert.java:592)
   at 
 org.apache.mahout.math.stats.TDigestTest.runTest(TDigestTest.java:373)
   at 
 org.apache.mahout.math.stats.TDigestTest.testUniform(TDigestTest.java:79)
 Results :
 Failed tests: 
   TDigestTest.testUniform:79-runTest:373 expected:0.5 but was:0.50578
 {Code}
 {Code}
 Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec
  FAILURE! - in org.apache.mahout.math.stats.TDigestTest
 testSequentialPoints(org.apache.mahout.math.stats.TDigestTest)  Time
 elapsed: 4.674 sec   FAILURE!
 java.lang.AssertionError: expected:0.5 but was:0.49489
 {Code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-742) Pagerank implementation in Map/Reduce

2014-01-28 Thread Nilesh Chakraborty (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884433#comment-13884433
 ] 

Nilesh Chakraborty commented on MAHOUT-742:
---

My bad, didn't know a lot about Mahout org.apache.mahout.math.Matrix and her 
friends were so full-featured. Thanks. Then it shouldn't be any problem. :-)

Actually I had come across #MAHOUT-879 (Remove all graph algorithms with the 
exception of PageRank) and was just checking with you if large-scale sparse 
mat-vec mult and PageRank implementations in MapReduce are welcome.

 Pagerank implementation in Map/Reduce
 -

 Key: MAHOUT-742
 URL: https://issues.apache.org/jira/browse/MAHOUT-742
 Project: Mahout
  Issue Type: New Feature
  Components: Graph
Affects Versions: 0.6
Reporter: Christoph Nagel
Assignee: Sebastian Schelter
 Fix For: 0.6

 Attachments: MAHOUT-742.patch


 Hi,
 my name is Christoph Nagel. I'm student on technical university Berlin and 
 participating on the course of Isabel Drost and Sebastian Schelter.
 My work is to implement the pagerank-algorithm, where the pagerank-vector 
 fits in memory.
 For the computation I used the naive algorithm shown in the book 'Mining of 
 Massive Datasets' from Rajaraman  Ullman 
 (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
 Matrix- and vector-multiplication are done with mahout methods.
 Most work is the transformation the input graph, which has to consists of a 
 nodes- and edges file.
 Format of nodes file: node\n
 Format of edges file: startNode\tendNode\n
 Therefore I created the following classes:
 * LineIndexer: assigns each line an index
 * EdgesToIndex: indexes the nodes of the edges
 * EdgesIndexToTransitionMatrix: creates the transition matrix
 * Pagerank: computes PR from transition matrix
 * JoinNodesWithPagerank: creates the joined output
 * PagerankExampleJob: does the complete job
 Each class has a test (not PagerankExampleJob) and I took the example of the 
 book for evaluating.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-742) Pagerank implementation in Map/Reduce

2014-01-28 Thread Nilesh Chakraborty (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884433#comment-13884433
 ] 

Nilesh Chakraborty edited comment on MAHOUT-742 at 1/28/14 7:00 PM:


My bad, didn't know that Mahout org.apache.mahout.math.Matrix and her friends 
were so full-featured. Thanks. Then it shouldn't be any problem. :-)

Actually I had come across #MAHOUT-879 (Remove all graph algorithms with the 
exception of PageRank) and was just checking with you if large-scale sparse 
mat-vec mult and PageRank implementations in MapReduce are welcome.


was (Author: nileshc):
My bad, didn't know a lot about Mahout org.apache.mahout.math.Matrix and her 
friends were so full-featured. Thanks. Then it shouldn't be any problem. :-)

Actually I had come across #MAHOUT-879 (Remove all graph algorithms with the 
exception of PageRank) and was just checking with you if large-scale sparse 
mat-vec mult and PageRank implementations in MapReduce are welcome.

 Pagerank implementation in Map/Reduce
 -

 Key: MAHOUT-742
 URL: https://issues.apache.org/jira/browse/MAHOUT-742
 Project: Mahout
  Issue Type: New Feature
  Components: Graph
Affects Versions: 0.6
Reporter: Christoph Nagel
Assignee: Sebastian Schelter
 Fix For: 0.6

 Attachments: MAHOUT-742.patch


 Hi,
 my name is Christoph Nagel. I'm student on technical university Berlin and 
 participating on the course of Isabel Drost and Sebastian Schelter.
 My work is to implement the pagerank-algorithm, where the pagerank-vector 
 fits in memory.
 For the computation I used the naive algorithm shown in the book 'Mining of 
 Massive Datasets' from Rajaraman  Ullman 
 (http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
 Matrix- and vector-multiplication are done with mahout methods.
 Most work is the transformation the input graph, which has to consists of a 
 nodes- and edges file.
 Format of nodes file: node\n
 Format of edges file: startNode\tendNode\n
 Therefore I created the following classes:
 * LineIndexer: assigns each line an index
 * EdgesToIndex: indexes the nodes of the edges
 * EdgesIndexToTransitionMatrix: creates the transition matrix
 * Pagerank: computes PR from transition matrix
 * JoinNodesWithPagerank: creates the joined output
 * PagerankExampleJob: does the complete job
 Each class has a test (not PagerankExampleJob) and I took the example of the 
 book for evaluating.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1411) Random test failures from TDigestTest

2014-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884473#comment-13884473
 ] 

Hudson commented on MAHOUT-1411:


SUCCESS: Integrated in Mahout-Quality #2449 (See 
[https://builds.apache.org/job/Mahout-Quality/2449/])
MAHOUT-1411: Random test failures from TDigestTest (smarthi: rev 1562146)
* /mahout/trunk/CHANGELOG
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/stats/TDigestTest.java


 Random test failures from TDigestTest
 -

 Key: MAHOUT-1411
 URL: https://issues.apache.org/jira/browse/MAHOUT-1411
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Priority: Minor
 Fix For: 0.9


 Seeing random test failures like below from TDigestTest. These errors are not 
 repeatable. 
 {Code}
 testUniform(org.apache.mahout.math.stats.TDigestTest)  Time elapsed: 0.356 
 sec   FAILURE!
 java.lang.AssertionError: expected:0.5 but was:0.50578
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:494)
   at org.junit.Assert.assertEquals(Assert.java:592)
   at 
 org.apache.mahout.math.stats.TDigestTest.runTest(TDigestTest.java:373)
   at 
 org.apache.mahout.math.stats.TDigestTest.testUniform(TDigestTest.java:79)
 Results :
 Failed tests: 
   TDigestTest.testUniform:79-runTest:373 expected:0.5 but was:0.50578
 {Code}
 {Code}
 Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.063 sec
  FAILURE! - in org.apache.mahout.math.stats.TDigestTest
 testSequentialPoints(org.apache.mahout.math.stats.TDigestTest)  Time
 elapsed: 4.674 sec   FAILURE!
 java.lang.AssertionError: expected:0.5 but was:0.49489
 {Code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (MAHOUT-1021) Blank csv input file given to Canopy/Kmeans clustering

2014-01-28 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi resolved MAHOUT-1021.
---

Resolution: Won't Fix

This issue was reported against Mahout 0.6, the codebase has since changed and 
the reported issue cannot be reproduced against the present trunk. Resolving 
this as 'Won't Fix'.

 Blank csv input file given to Canopy/Kmeans clustering
 --

 Key: MAHOUT-1021
 URL: https://issues.apache.org/jira/browse/MAHOUT-1021
 Project: Mahout
  Issue Type: Improvement
  Components: Integration
Affects Versions: 0.6
 Environment: Mahout 0.6 version on hadoop 0.2, Testing on 
 HadooponAzure platform
Reporter: Nabarun Sengupta
Assignee: Suneel Marthi
Priority: Minor
 Fix For: Backlog


 Hi,
 This is regarding a bug that we observed in Canopy clustering. We could 
 reflect the same in Kmeans too. Given a blank csv input file, we observe the 
 algorithm executes two jobs, during the third job execution, it throws an 
 error. When I tried to execute a malformed csv file with decimal or 
 characters, I received an error during the first job itself. Therefore, I 
 feel the same validation should be done if the input file is blank and 
 exception should be thrown during the first job execution.
 Following is the job execution details:
 Apps\dist\mahout\examples\binbuild-cluster-syntheticcontrol.cmd
 ease select a number to choose the corresponding clustering algorithm
  canopy clustering
  kmeans clustering
  fuzzykmeans clustering
  dirichlet clustering
  meanshift clustering
 er your choice:1
 . You chose 1 and we'll use canopy Clustering
 S is healthy... 
 loading Synthetic control data to HDFS
 eted hdfs://10.114.251.23:9000/user/milind/testdata
 ccessfully Uploaded Synthetic control data to HDFS 
 nning on hadoop, using HADOOP_HOME=c:\Apps\dist
 Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar 
 org.apache.mahout.driver.MahoutDriver org.apache.mah
 ontrol.canopy.Job
 05/17 10:46:11 WARN driver.MahoutDriver: No 
 org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props found on 
 classpath
 rguments only
 05/17 10:46:11 INFO canopy.Job: Running with default arguments
 05/17 10:46:12 INFO common.HadoopUtil: Deleting output
 05/17 10:46:12 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool
 05/17 10:46:13 INFO input.FileInputFormat: Total input paths to process : 1
 05/17 10:46:14 INFO mapred.JobClient: Running job: job_201205170655_0017
 05/17 10:46:15 INFO mapred.JobClient:  map 0% reduce 0%
 05/17 10:46:48 INFO mapred.JobClient:  map 100% reduce 0%
 05/17 10:46:59 INFO mapred.JobClient: Job complete: job_201205170655_0017
 05/17 10:46:59 INFO mapred.JobClient: Counters: 15
 05/17 10:46:59 INFO mapred.JobClient:   Job Counters
 05/17 10:46:59 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=29672
 05/17 10:46:59 INFO mapred.JobClient: Total time spent by all reduces 
 waiting after reserving slots (ms)=0
 05/17 10:46:59 INFO mapred.JobClient: Total time spent by all maps 
 waiting after reserving slots (ms)=0
 05/17 10:46:59 INFO mapred.JobClient: Launched map tasks=1
 05/17 10:46:59 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
 05/17 10:46:59 INFO mapred.JobClient:   File Output Format Counters
 05/17 10:46:59 INFO mapred.JobClient: Bytes Written=90
 05/17 10:46:59 INFO mapred.JobClient:   FileSystemCounters
 05/17 10:46:59 INFO mapred.JobClient: FILE_BYTES_READ=130
 05/17 10:46:59 INFO mapred.JobClient: HDFS_BYTES_READ=134
 05/17 10:46:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21557
 05/17 10:46:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=90
 05/17 10:46:59 INFO mapred.JobClient:   File Input Format Counters
 05/17 10:46:59 INFO mapred.JobClient: Bytes Read=0
 05/17 10:46:59 INFO mapred.JobClient:   Map-Reduce Framework
 05/17 10:46:59 INFO mapred.JobClient: Map input records=0
 05/17 10:46:59 INFO mapred.JobClient: Spilled Records=0
 05/17 10:46:59 INFO mapred.JobClient: Map output records=0
 05/17 10:46:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=134
 05/17 10:46:59 INFO canopy.CanopyDriver: Build Clusters Input: output/data 
 Out: output Measure: org.apache.mahout.common.dist
 sure@6eedf759 t1: 80.0 t2: 55.0
 05/17 10:46:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool
 05/17 10:46:59 INFO input.FileInputFormat: Total input paths to process : 1
 05/17 10:47:00 INFO mapred.JobClient: Running job: job_201205170655_0018
 05/17 10:47:01 INFO mapred.JobClient:  map 0% reduce 0%
 05/17 10:47:33 INFO mapred.JobClient:  map 100% reduce 0%
 05/17 10:47:51 INFO mapred.JobClient:  map 100% reduce 100%
 05/17 10:48:02 INFO 

Mahout 0.9 Release

2014-01-28 Thread Suneel Marthi
Fixed the issues that were reported with Clustering code this past week, 
upgraded codebase to Lucene 4.6.1 that was released today.

Here's the URL for the 0.9 release in staging:-
https://repository.apache.org/content/repositories/orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/

The artifacts have been signed with the following key:
https://people.apache.org/keys/committer/smarthi.asc

Please:-
a) Verify that u can unpack the release (tar or zip)
b) Verify u r able to compile the distro
c)  Run through the unit tests: mvn clean test
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through 
all the different options in each script.

Need a minimum of 3 '+1' votes from PMC for the release to be finalized.

RE: Mahout 0.9 Release

2014-01-28 Thread Andrew Palumbo
a), b), c), d) all passed here. 

CosineDistance of clustered points from cluster-reuters.sh -1 kmeans were 
within the range [0,1].

 Date: Tue, 28 Jan 2014 16:45:42 -0800
 From: suneel_mar...@yahoo.com
 Subject: Mahout 0.9 Release
 To: u...@mahout.apache.org; dev@mahout.apache.org
 
 Fixed the issues that were reported with Clustering code this past week, 
 upgraded codebase to Lucene 4.6.1 that was released today.
 
 Here's the URL for the 0.9 release in staging:-
 https://repository.apache.org/content/repositories/orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/
 
 The artifacts have been signed with the following key:
 https://people.apache.org/keys/committer/smarthi.asc
 
 Please:-
 a) Verify that u can unpack the release (tar or zip)
 b) Verify u r able to compile the distro
 c)  Run through the unit tests: mvn clean test
 d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run 
 through all the different options in each script.
 
 Need a minimum of 3 '+1' votes from PMC for the release to be finalized.
  

Re: Mahout 0.9 Release

2014-01-28 Thread Andrew Musselman
Looks good.

+1


On Tue, Jan 28, 2014 at 8:07 PM, Andrew Palumbo ap@outlook.com wrote:

 a), b), c), d) all passed here.

 CosineDistance of clustered points from cluster-reuters.sh -1 kmeans were
 within the range [0,1].

  Date: Tue, 28 Jan 2014 16:45:42 -0800
  From: suneel_mar...@yahoo.com
  Subject: Mahout 0.9 Release
  To: u...@mahout.apache.org; dev@mahout.apache.org
 
  Fixed the issues that were reported with Clustering code this past week,
 upgraded codebase to Lucene 4.6.1 that was released today.
 
  Here's the URL for the 0.9 release in staging:-
 
 https://repository.apache.org/content/repositories/orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/
 
  The artifacts have been signed with the following key:
  https://people.apache.org/keys/committer/smarthi.asc
 
  Please:-
  a) Verify that u can unpack the release (tar or zip)
  b) Verify u r able to compile the distro
  c)  Run through the unit tests: mvn clean test
  d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run
 through all the different options in each script.
 
  Need a minimum of 3 '+1' votes from PMC for the release to be finalized.