[jira] [Commented] (MAHOUT-1417) Random decision forest implementation fails in Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903074#comment-13903074 ] Sean Owen commented on MAHOUT-1417: --- Ah thanks for catching that test failure! Random decision forest implementation fails in Hadoop 2 --- Key: MAHOUT-1417 URL: https://issues.apache.org/jira/browse/MAHOUT-1417 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.7, 0.8, 0.9 Environment: CDH 4.5.0.1 + Mahout 0.7+patches Reporter: Sean Owen Labels: classifier, random-decision-forests, rdf Fix For: 1.0 Attachments: MAHOUT-1417.patch Original Estimate: 24h Remaining Estimate: 24h We've observed two errors in the RDF implementation, one of which stops it from working on Hadoop 2 (at least I think it is Hadoop 2 only), and one of which just makes the workload quite imbalanced. A key piece of logic in PartialBuilder.java queries mapred.map.tasks to know the total number of mappers. However this has never been guaranteed to be set to the number of mappers; it is how a caller sets a default number of mappers, which may be overridden by Hadoop, and which defaults to 1. I suspect that this may have actually been set, in some or all cases, to the number of mappers in Hadoop 1, but I am not sure. Certainly, sometimes it will happen to be set to a value that equals the number of mappers used. But when it doesn't it causes the distribution of trees to mappers to be quite wrong. For example, with 20 trees and 8 mappers in one example, I find that mapred.map.tasks=1. Logging messages indicate that mapper 0 handles all trees (0-19), mapper 1 handles non-existent 20-39, etc. The result is that most mappers do nothing and one does everything. This results in empty part-m-x files. And, that in turn fails the job. (This part I also suspect is new, or situation-specific, behavior in Hadoop 2. In any event, this code should never have idle mappers and fixing that avoids whatever is going on there.) There's a second less serious issue in how trees are assigned to mappers. When the number of trees is not a multiple of the number of mappers, the remainer is assigned entirely to mapper 0. So with 20 trees and 8 mappers, all mappers build 2 trees, but mapper 0 builds 6. This is unnecessarily imbalanced. Patch coming once I can verify the fix, but current proposal is to: - Compute the number of maps ahead of time using TextInputFormat and set mapred.map.tasks - Fix the method that computes trees per mapper to spread as evenly as possible (i.e. all mappers build either N or N+1 trees) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Mahout 0.9 Release Notes - First Draft
On Tue, Feb 11, 2014 at 10:06 AM, Suneel Marthi suneel_mar...@yahoo.comwrote: Switched LDA implementation from using Dirtichlet to Collapsed Variational Bayes (CVB) This line should read: Switched LDA implementation from using Gibb's sampling to Collapsed Variational Bayes (CVB) Otherwise, it looks pretty good.
Re: Mahout 0.9 Release Notes - First Draft
Hi Suneel, Thanks for notes. I'm inquiring about status of the notes and update to the website to announce 0.9: Ted has reviewed the release notes - were you waiting for additional input or are they ready to go on the website? Are you the one who updates the site? I've been asked to write a short blog on the release but wanted to wait until the site is updated. Thanks much Ellen On Tue, Feb 11, 2014 at 10:06 AM, Suneel Marthi suneel_mar...@yahoo.comwrote: Here's a draft of the Release Notes for Mahout 0.9, Please review the same. -- The Apache Mahout PMC is pleased to announce the release of Mahout 0.9. Mahout's goal is to build scalable machine learning libraries focused primarily in the areas of collaborative filtering (recommenders), clustering and classification (known collectively as the 3Cs), as well as the necessary infrastructure to support those implementations including, but not limited to, math packages for statistics, linear algebra and others as well as Java primitive collections, local and distributed vector and matrix classes and a variety of integrative code to work with popular packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache Cassandra and much more. The 0.9 release is mainly a clean up release in preparation for an upcoming 1.0 release targeted for first half of 2014, but there are a few significant new features, which are highlighted below. To get started with Apache Mahout 0.9, download the release artifacts and signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central Maven repository. As with any release, we wish to thank all of the users and contributors to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for individual credits, as there are too many to list here. GETTING STARTED In the release package, the examples directory contains several working examples of the core functionality available in Mahout. These can be run via scripts in the examples/bin directory and will prompt you for more information to help you try things out. Most examples do not need a Hadoop cluster in order to run. RELEASE HIGHLIGHTS The highlights of the Apache Mahout 0.9 release include, but are not limited to the list below. For further information, see the included CHANGELOG[1] file. - MAHOUT-1297: Scala DSL Bindings for Mahout Math Linear Algebra. See http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html - MAHOUT-1288: Recommenders as a Search. See https://github.com/pferrel/solr-recommender - MAHOUT-1364: Upgrade Mahout to Lucene 4.6.1 - MAHOUT-1361: Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering See https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdffor the details. - MAHOUT-1265: MultiLayer Perceptron (MLP) classifier This is an early implementation of MLP to solicit user feedback, needs to be integrated into Mahout's processing pipeline to work with Mahout's vectors. - Removed Deprecated algorithms as they have been either replaced by better performing algorithms or lacked user support and maintenance. - the usual bug fixes. See [2] for more information on the 0.9 release. A total of 113 separate JIRA issues were addressed in this release. The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: - From Clustering: Switched LDA implementation from using Dirtichlet to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage - From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support - Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender - Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy CONTRIBUTING Mahout is always looking for contributions focused on the 3Cs. If you are interested in contributing, please see our contribution page http://mahout.apache.org/developers/how-to-contribute.html or contact us via email at dev@mahout.apache.org. As the project moves towards a 1.0 release, the community will be focused on key algorithms that are proven to scale in production and have seen wide-spread adoption. [1] http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?view=markuppathrev=1563661 [2] https://issues.apache.org/jira/browse/MAHOUT-1411?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.9%22 On Monday, December 23, 2013 7:41 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Sun, Dec 22, 2013 at
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
For the symmetric case, SVD is eigen decomposition. On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote: If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
Agreed, and this is the case where Lanczos algorithm is obsolete. My point is: if SSVD is unable to find the eigenvector of asymmetric matrix (this is a common formulation of PageRank, and some random walks, and many other things), then we still have to rely on large-scale Lanczos algorithm. On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote: For the symmetric case, SVD is eigen decomposition. On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote: If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Build failed in Jenkins: mahout-nightly ยป Mahout Release Package #1500
See https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-distribution/1500/ -- [INFO] [INFO] [INFO] Building Mahout Release Package 1.0-SNAPSHOT [INFO] [INFO] [INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ mahout-distribution --- [INFO] Deleting https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-distribution/ws/target [INFO] [INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact (remove-old-mahout-artifacts) @ mahout-distribution --- [INFO] /home/jenkins/jenkins-slave/maven-repositories/0/org/apache/mahout/mahout-distribution removed. [INFO] [INFO] --- maven-assembly-plugin:2.4:single (bin-assembly) @ mahout-distribution --- [INFO] Reading assembly descriptor: src/main/assembly/bin.xml
Build failed in Jenkins: mahout-nightly #1500
See https://builds.apache.org/job/mahout-nightly/1500/ -- [...truncated 1977 lines...] [INFO] Copying jackson-mapper-asl-1.9.12.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jackson-mapper-asl-1.9.12.jar [INFO] Copying lucene-benchmark-4.6.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-benchmark-4.6.1.jar [INFO] Copying junit-4.11.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/junit-4.11.jar [INFO] Copying lucene-sandbox-4.6.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-sandbox-4.6.1.jar [INFO] Copying slf4j-api-1.7.5.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/slf4j-api-1.7.5.jar [INFO] Copying randomizedtesting-runner-2.0.15.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/randomizedtesting-runner-2.0.15.jar [INFO] Copying jcl-over-slf4j-1.7.5.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jcl-over-slf4j-1.7.5.jar [INFO] Copying mahout-math-1.0-SNAPSHOT-tests.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-math-1.0-SNAPSHOT-tests.jar [INFO] Copying jaxb-api-2.2.2.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jaxb-api-2.2.2.jar [INFO] Copying jettison-1.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jettison-1.1.jar [INFO] Copying lucene-spatial-4.6.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-spatial-4.6.1.jar [INFO] Copying commons-httpclient-3.0.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-httpclient-3.0.1.jar [INFO] Copying commons-compress-1.4.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-compress-1.4.1.jar [INFO] Copying hamcrest-core-1.3.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/hamcrest-core-1.3.jar [INFO] Copying commons-lang-2.4.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-lang-2.4.jar [INFO] Copying lucene-queries-4.6.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-queries-4.6.1.jar [INFO] Copying commons-codec-1.4.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-codec-1.4.jar [INFO] Copying solr-commons-csv-3.5.0.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/solr-commons-csv-3.5.0.jar [INFO] Copying commons-collections-3.2.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-collections-3.2.1.jar [INFO] Copying xercesImpl-2.9.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/xercesImpl-2.9.1.jar [INFO] Copying mahout-core-1.0-SNAPSHOT.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-core-1.0-SNAPSHOT.jar [INFO] Copying xstream-1.4.4.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/xstream-1.4.4.jar [INFO] Copying mahout-integration-1.0-SNAPSHOT.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-integration-1.0-SNAPSHOT.jar [INFO] Copying guava-16.0.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/guava-16.0.jar [INFO] Copying slf4j-log4j12-1.7.5.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/slf4j-log4j12-1.7.5.jar [INFO] Copying cglib-nodep-2.2.2.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/cglib-nodep-2.2.2.jar [INFO] Copying commons-configuration-1.6.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-configuration-1.6.jar [INFO] Copying commons-io-2.4.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-io-2.4.jar [INFO] Copying jackson-xc-1.7.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jackson-xc-1.7.1.jar [INFO] Copying jersey-json-1.8.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jersey-json-1.8.jar [INFO] Copying commons-math-2.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-math-2.1.jar [INFO] Copying jaxb-impl-2.2.3-1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jaxb-impl-2.2.3-1.jar [INFO] Copying jersey-core-1.8.jar to
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
I bet page rank in mllib in spark finds stationary distribution much faster. On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote: Agreed, and this is the case where Lanczos algorithm is obsolete. My point is: if SSVD is unable to find the eigenvector of asymmetric matrix (this is a common formulation of PageRank, and some random walks, and many other things), then we still have to rely on large-scale Lanczos algorithm. On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote: For the symmetric case, SVD is eigen decomposition. On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote: If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng