[jira] [Commented] (MAHOUT-1417) Random decision forest implementation fails in Hadoop 2

2014-02-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903074#comment-13903074
 ] 

Sean Owen commented on MAHOUT-1417:
---

Ah thanks for catching that test failure!

 Random decision forest implementation fails in Hadoop 2
 ---

 Key: MAHOUT-1417
 URL: https://issues.apache.org/jira/browse/MAHOUT-1417
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.7, 0.8, 0.9
 Environment: CDH 4.5.0.1 + Mahout 0.7+patches
Reporter: Sean Owen
  Labels: classifier, random-decision-forests, rdf
 Fix For: 1.0

 Attachments: MAHOUT-1417.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 We've observed two errors in the RDF implementation, one of which stops it 
 from working on Hadoop 2 (at least I think it is Hadoop 2 only), and one of 
 which just makes the workload quite imbalanced.
 A key piece of logic in PartialBuilder.java queries mapred.map.tasks to know 
 the total number of mappers. However this has never been guaranteed to be set 
 to the number of mappers; it is how a caller sets a default number of 
 mappers, which may be overridden by Hadoop, and which defaults to 1. 
 I suspect that this may have actually been set, in some or all cases, to the 
 number of mappers in Hadoop 1, but I am not sure. Certainly, sometimes it 
 will happen to be set to a value that equals the number of mappers used.
 But when it doesn't it causes the distribution of trees to mappers to be 
 quite wrong. For example, with 20 trees and 8 mappers in one example, I find 
 that mapred.map.tasks=1. Logging messages indicate that mapper 0 handles all 
 trees (0-19), mapper 1 handles non-existent 20-39, etc.
 The result is that most mappers do nothing and one does everything. This 
 results in empty part-m-x files. And, that in turn fails the job. (This 
 part I also suspect is new, or situation-specific, behavior in Hadoop 2. In 
 any event, this code should never have idle mappers and fixing that avoids 
 whatever is going on there.)
 There's a second less serious issue in how trees are assigned to mappers. 
 When the number of trees is not a multiple of the number of mappers, the 
 remainer is assigned entirely to mapper 0. So with 20 trees and 8 mappers, 
 all mappers build 2 trees, but mapper 0 builds 6. This is unnecessarily 
 imbalanced.
 Patch coming once I can verify the fix, but current proposal is to:
 - Compute the number of maps ahead of time using TextInputFormat and set 
 mapred.map.tasks
 - Fix the method that computes trees per mapper to spread as evenly as 
 possible (i.e. all mappers build either N or N+1 trees)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Mahout 0.9 Release Notes - First Draft

2014-02-17 Thread Ted Dunning
On Tue, Feb 11, 2014 at 10:06 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:

Switched LDA implementation from using Dirtichlet to Collapsed
 Variational Bayes (CVB)


This line should read:

Switched LDA implementation from using Gibb's sampling to Collapsed
Variational Bayes (CVB)


Otherwise, it looks pretty good.


Re: Mahout 0.9 Release Notes - First Draft

2014-02-17 Thread Ellen Friedman
Hi Suneel,

Thanks for notes. I'm inquiring about status of the notes and update to the
website to announce 0.9: Ted has reviewed the release notes - were you
waiting for additional input or are they ready to go on the website? Are
you the one who updates the site?

I've been asked to write a short blog on the release but wanted to wait
until the site is updated.

Thanks much
Ellen



On Tue, Feb 11, 2014 at 10:06 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 Here's a draft of the Release Notes for Mahout 0.9, Please review the same.

 --


 The Apache Mahout PMC is pleased to announce the release of Mahout 0.9.
 Mahout's goal is to build scalable machine learning libraries focused
 primarily in the areas of collaborative filtering (recommenders),
 clustering and classification (known collectively as the 3Cs), as well
 as the
 necessary infrastructure to support those implementations including, but
 not limited to, math packages for statistics, linear algebra and others
 as well as Java primitive collections, local and distributed vector and
 matrix classes and a variety of integrative code to work with popular
 packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache
 Cassandra and much more. The 0.9 release is mainly a clean up release in
 preparation for an upcoming 1.0 release targeted for first half of 2014,
 but there are a few
 significant new features, which are highlighted below.

 To get started with Apache Mahout 0.9, download the release artifacts and
 signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the
 central Maven repository.

 As with any release, we wish to thank all of the users and contributors
 to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for
 individual credits, as there are too many to list here.

 GETTING STARTED

 In the release package, the examples directory contains several working
 examples of the core
 functionality available in Mahout. These can be run via scripts in the
 examples/bin
 directory and will prompt you for more information to help you try things
 out.
 Most examples do not need a Hadoop cluster in order to run.

 RELEASE HIGHLIGHTS

 The highlights of the Apache Mahout 0.9 release include, but are not
 limited to the list below. For further information, see the included
 CHANGELOG[1] file.

 -  MAHOUT-1297: Scala DSL Bindings for Mahout Math Linear Algebra.
See
 http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html
 -  MAHOUT-1288: Recommenders as a Search.  See
 https://github.com/pferrel/solr-recommender
 -  MAHOUT-1364: Upgrade Mahout to Lucene 4.6.1
 -  MAHOUT-1361: Online Algorithm for computing accurate Quantiles using
 1-dimensional Clustering
   See
 https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdffor
  the details.
 -  MAHOUT-1265: MultiLayer Perceptron (MLP) classifier
This is an early implementation of MLP to solicit user feedback, needs
 to be integrated into Mahout's processing pipeline to work with Mahout's
 vectors.

 - Removed Deprecated algorithms as they have been either replaced by
 better performing algorithms or lacked user support and maintenance.

 - the usual bug fixes. See [2] for more information on the 0.9 release.

 A total of 113 separate JIRA issues were addressed in this release.

 The following algorithms that were marked deprecated in 0.8 have been
 removed in 0.9:

 - From Clustering:
Switched LDA implementation from using Dirtichlet to Collapsed
 Variational Bayes (CVB)

   Meanshift

   MinHash - removed due to poor performance,  lack of support and lack of
 usage

 - From Classification (both are sequential implementations)

   Winnow - lack of actual usage and support

   Perceptron - lack of actual usage and support

 - Collaborative Filtering
 SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone
 and org.apache.mahout.cf.taste.impl.recommender.slopeone
 Distributed pseudo recommender in
 org.apache.mahout.cf.taste.hadoop.pseudo
 TreeClusteringRecommender in
 org.apache.mahout.cf.taste.impl.recommender

 - Mahout Math
 Hadoop entropy stuff in org.apache.mahout.math.stats.entropy

 CONTRIBUTING

 Mahout is always looking for contributions focused on the 3Cs. If you are
 interested in contributing, please see our contribution page
 http://mahout.apache.org/developers/how-to-contribute.html or contact us
 via email at dev@mahout.apache.org.

 As the project moves towards a 1.0 release, the community will be focused
 on key algorithms that are proven to scale in production and have seen
 wide-spread adoption.

 [1]
 http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?view=markuppathrev=1563661
 [2]
 https://issues.apache.org/jira/browse/MAHOUT-1411?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.9%22








 On Monday, December 23, 2013 7:41 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 On Sun, Dec 22, 2013 at 

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-17 Thread peng
If SSVD is not designed for such eigenvector problem. Then I would vote 
for retaining the Lanczos algorithm.
However, I would like to see the opposite case, I have tested both 
algorithms on symmetric case and SSVD is much faster and more accurate 
than its competitor.


Yours Peng

On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

In PageRank I'm afraid I have no other option than eigenvector
\lambda, but not singular vector u  v:) The PageRank in Mahout was
removed with other graph-based algorithm.

On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

SSVD is very probably better than Lanczos for any large decomposition.
  That said, it does SVD, not eigen decomposition which means that the
question of symmetrical matrices or positive definiteness doesn't much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:


Just asking for possible replacement of our Lanczos-based PageRank
implementation. - Peng





Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-17 Thread Ted Dunning
For the symmetric case, SVD is eigen decomposition.




On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:

 If SSVD is not designed for such eigenvector problem. Then I would vote
 for retaining the Lanczos algorithm.
 However, I would like to see the opposite case, I have tested both
 algorithms on symmetric case and SSVD is much faster and more accurate than
 its competitor.

 Yours Peng

 On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

 In PageRank I'm afraid I have no other option than eigenvector
 \lambda, but not singular vector u  v:) The PageRank in Mahout was
 removed with other graph-based algorithm.

 On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

 SSVD is very probably better than Lanczos for any large decomposition.
   That said, it does SVD, not eigen decomposition which means that the
 question of symmetrical matrices or positive definiteness doesn't much
 matter.

 Do you really need eigen-decomposition?



 On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

  Just asking for possible replacement of our Lanczos-based PageRank
 implementation. - Peng





Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-17 Thread peng

Agreed, and this is the case where Lanczos algorithm is obsolete.
My point is: if SSVD is unable to find the eigenvector of asymmetric 
matrix (this is a common formulation of PageRank, and some random 
walks, and many other things), then we still have to rely on 
large-scale Lanczos algorithm.


On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote:

For the symmetric case, SVD is eigen decomposition.




On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:


If SSVD is not designed for such eigenvector problem. Then I would vote
for retaining the Lanczos algorithm.
However, I would like to see the opposite case, I have tested both
algorithms on symmetric case and SSVD is much faster and more accurate than
its competitor.

Yours Peng

On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:


In PageRank I'm afraid I have no other option than eigenvector
\lambda, but not singular vector u  v:) The PageRank in Mahout was
removed with other graph-based algorithm.

On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:


SSVD is very probably better than Lanczos for any large decomposition.
   That said, it does SVD, not eigen decomposition which means that the
question of symmetrical matrices or positive definiteness doesn't much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

  Just asking for possible replacement of our Lanczos-based PageRank

implementation. - Peng








Build failed in Jenkins: mahout-nightly ยป Mahout Release Package #1500

2014-02-17 Thread Apache Jenkins Server
See 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-distribution/1500/

--
[INFO] 
[INFO] 
[INFO] Building Mahout Release Package 1.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ mahout-distribution 
---
[INFO] Deleting 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-distribution/ws/target
[INFO] 
[INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact 
(remove-old-mahout-artifacts) @ mahout-distribution ---
[INFO] 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/mahout/mahout-distribution
 removed.
[INFO] 
[INFO] --- maven-assembly-plugin:2.4:single (bin-assembly) @ 
mahout-distribution ---
[INFO] Reading assembly descriptor: src/main/assembly/bin.xml


Build failed in Jenkins: mahout-nightly #1500

2014-02-17 Thread Apache Jenkins Server
See https://builds.apache.org/job/mahout-nightly/1500/

--
[...truncated 1977 lines...]
[INFO] Copying jackson-mapper-asl-1.9.12.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jackson-mapper-asl-1.9.12.jar
[INFO] Copying lucene-benchmark-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-benchmark-4.6.1.jar
[INFO] Copying junit-4.11.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/junit-4.11.jar
[INFO] Copying lucene-sandbox-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-sandbox-4.6.1.jar
[INFO] Copying slf4j-api-1.7.5.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/slf4j-api-1.7.5.jar
[INFO] Copying randomizedtesting-runner-2.0.15.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/randomizedtesting-runner-2.0.15.jar
[INFO] Copying jcl-over-slf4j-1.7.5.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jcl-over-slf4j-1.7.5.jar
[INFO] Copying mahout-math-1.0-SNAPSHOT-tests.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-math-1.0-SNAPSHOT-tests.jar
[INFO] Copying jaxb-api-2.2.2.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jaxb-api-2.2.2.jar
[INFO] Copying jettison-1.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jettison-1.1.jar
[INFO] Copying lucene-spatial-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-spatial-4.6.1.jar
[INFO] Copying commons-httpclient-3.0.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-httpclient-3.0.1.jar
[INFO] Copying commons-compress-1.4.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-compress-1.4.1.jar
[INFO] Copying hamcrest-core-1.3.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/hamcrest-core-1.3.jar
[INFO] Copying commons-lang-2.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-lang-2.4.jar
[INFO] Copying lucene-queries-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-queries-4.6.1.jar
[INFO] Copying commons-codec-1.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-codec-1.4.jar
[INFO] Copying solr-commons-csv-3.5.0.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/solr-commons-csv-3.5.0.jar
[INFO] Copying commons-collections-3.2.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-collections-3.2.1.jar
[INFO] Copying xercesImpl-2.9.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/xercesImpl-2.9.1.jar
[INFO] Copying mahout-core-1.0-SNAPSHOT.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-core-1.0-SNAPSHOT.jar
[INFO] Copying xstream-1.4.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/xstream-1.4.4.jar
[INFO] Copying mahout-integration-1.0-SNAPSHOT.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-integration-1.0-SNAPSHOT.jar
[INFO] Copying guava-16.0.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/guava-16.0.jar
[INFO] Copying slf4j-log4j12-1.7.5.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/slf4j-log4j12-1.7.5.jar
[INFO] Copying cglib-nodep-2.2.2.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/cglib-nodep-2.2.2.jar
[INFO] Copying commons-configuration-1.6.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-configuration-1.6.jar
[INFO] Copying commons-io-2.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-io-2.4.jar
[INFO] Copying jackson-xc-1.7.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jackson-xc-1.7.1.jar
[INFO] Copying jersey-json-1.8.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jersey-json-1.8.jar
[INFO] Copying commons-math-2.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-math-2.1.jar
[INFO] Copying jaxb-impl-2.2.3-1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jaxb-impl-2.2.3-1.jar
[INFO] Copying jersey-core-1.8.jar to 

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-17 Thread Dmitriy Lyubimov
I bet page rank in mllib in spark finds stationary distribution much faster.
On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote:

 Agreed, and this is the case where Lanczos algorithm is obsolete.
 My point is: if SSVD is unable to find the eigenvector of asymmetric
 matrix (this is a common formulation of PageRank, and some random walks,
 and many other things), then we still have to rely on large-scale Lanczos
 algorithm.

 On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote:

 For the symmetric case, SVD is eigen decomposition.




 On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:

  If SSVD is not designed for such eigenvector problem. Then I would vote
 for retaining the Lanczos algorithm.
 However, I would like to see the opposite case, I have tested both
 algorithms on symmetric case and SSVD is much faster and more accurate
 than
 its competitor.

 Yours Peng

 On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

  In PageRank I'm afraid I have no other option than eigenvector
 \lambda, but not singular vector u  v:) The PageRank in Mahout was
 removed with other graph-based algorithm.

 On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

  SSVD is very probably better than Lanczos for any large decomposition.
That said, it does SVD, not eigen decomposition which means that the
 question of symmetrical matrices or positive definiteness doesn't much
 matter.

 Do you really need eigen-decomposition?



 On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

   Just asking for possible replacement of our Lanczos-based PageRank

 implementation. - Peng