Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
Really? I guess PageRank in mahout was removed due to inherited network bottleneck of mapreduce. But I didn't know MLlib has the implementation. Is mllib implementation based on Lanczos or SSVD? Just curious... On 17/02/2014 11:11 PM, Dmitriy Lyubimov wrote: I bet page rank in mllib in spark finds stationary distribution much faster. On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote: Agreed, and this is the case where Lanczos algorithm is obsolete. My point is: if SSVD is unable to find the eigenvector of asymmetric matrix (this is a common formulation of PageRank, and some random walks, and many other things), then we still have to rely on large-scale Lanczos algorithm. On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote: For the symmetric case, SVD is eigen decomposition. On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote: If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
You can also use giraph for a superfast PageRank implementation. Giraph even runs on standard hadoop clusters. Pagerank is usually computed by power iteration, which is much simpler than lanczos or ssvd and only gives the eigenvector associated with the largest eigenvalue. Am 18.02.2014 09:33 schrieb Peng Cheng pc...@uowmail.edu.au: Really? I guess PageRank in mahout was removed due to inherited network bottleneck of mapreduce. But I didn't know MLlib has the implementation. Is mllib implementation based on Lanczos or SSVD? Just curious... On 17/02/2014 11:11 PM, Dmitriy Lyubimov wrote: I bet page rank in mllib in spark finds stationary distribution much faster. On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote: Agreed, and this is the case where Lanczos algorithm is obsolete. My point is: if SSVD is unable to find the eigenvector of asymmetric matrix (this is a common formulation of PageRank, and some random walks, and many other things), then we still have to rely on large-scale Lanczos algorithm. On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote: For the symmetric case, SVD is eigen decomposition. On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote: If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
[jira] [Created] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel
Isabel Drost-Fromm created MAHOUT-1418: -- Summary: Removal of write access to anything but CMS for username isabel Key: MAHOUT-1418 URL: https://issues.apache.org/jira/browse/MAHOUT-1418 Project: Mahout Issue Type: Task Reporter: Isabel Drost-Fromm Assignee: Isabel Drost-Fromm Priority: Trivial Hi, Please remove write access to user name isabel - effective mid March. For background check the Mahout board report of October last year*. Don't worry - I'm not planning to go completely silent and offline by then. However I know from several years of doing Berlin Buzzwords that being completely sleep deprived is not a good state to commit to subversion - except that when sleep deprived I usually don't remember this insight. So this is my security net forcing me to go through the regular submit patch in jira, get it reviewed and committed cycle (except for documentation changes). * http://www.apache.org/foundation/records/minutes/2013/board_minutes_2013_10_16.txt -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel
[ https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903884#comment-13903884 ] Isabel Drost-Fromm commented on MAHOUT-1418: Created INFRA issue due mid March to execute the permission change. Removal of write access to anything but CMS for username isabel --- Key: MAHOUT-1418 URL: https://issues.apache.org/jira/browse/MAHOUT-1418 Project: Mahout Issue Type: Task Reporter: Isabel Drost-Fromm Assignee: Isabel Drost-Fromm Priority: Trivial Hi, Please remove write access to user name isabel - effective mid March. For background check the Mahout board report of October last year*. Don't worry - I'm not planning to go completely silent and offline by then. However I know from several years of doing Berlin Buzzwords that being completely sleep deprived is not a good state to commit to subversion - except that when sleep deprived I usually don't remember this insight. So this is my security net forcing me to go through the regular submit patch in jira, get it reviewed and committed cycle (except for documentation changes). * http://www.apache.org/foundation/records/minutes/2013/board_minutes_2013_10_16.txt -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel
[ https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904200#comment-13904200 ] Suneel Marthi commented on MAHOUT-1418: --- Are u serious u really need to do this ? Removal of write access to anything but CMS for username isabel --- Key: MAHOUT-1418 URL: https://issues.apache.org/jira/browse/MAHOUT-1418 Project: Mahout Issue Type: Task Reporter: Isabel Drost-Fromm Assignee: Isabel Drost-Fromm Priority: Trivial Hi, Please remove write access to user name isabel - effective mid March. For background check the Mahout board report of October last year*. Don't worry - I'm not planning to go completely silent and offline by then. However I know from several years of doing Berlin Buzzwords that being completely sleep deprived is not a good state to commit to subversion - except that when sleep deprived I usually don't remember this insight. So this is my security net forcing me to go through the regular submit patch in jira, get it reviewed and committed cycle (except for documentation changes). * http://www.apache.org/foundation/records/minutes/2013/board_minutes_2013_10_16.txt -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Mahout 0.9 Release Notes - First Draft
Could someone please point me to the URL for adding Mahout release notes? On Monday, February 17, 2014 3:27 PM, Ellen Friedman b.ellen.fried...@gmail.com wrote: Hi Suneel, Thanks for notes. I'm inquiring about status of the notes and update to the website to announce 0.9: Ted has reviewed the release notes - were you waiting for additional input or are they ready to go on the website? Are you the one who updates the site? I've been asked to write a short blog on the release but wanted to wait until the site is updated. Thanks much Ellen On Tue, Feb 11, 2014 at 10:06 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Here's a draft of the Release Notes for Mahout 0.9, Please review the same. -- The Apache Mahout PMC is pleased to announce the release of Mahout 0.9. Mahout's goal is to build scalable machine learning libraries focused primarily in the areas of collaborative filtering (recommenders), clustering and classification (known collectively as the 3Cs), as well as the necessary infrastructure to support those implementations including, but not limited to, math packages for statistics, linear algebra and others as well as Java primitive collections, local and distributed vector and matrix classes and a variety of integrative code to work with popular packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache Cassandra and much more. The 0.9 release is mainly a clean up release in preparation for an upcoming 1.0 release targeted for first half of 2014, but there are a few significant new features, which are highlighted below. To get started with Apache Mahout 0.9, download the release artifacts and signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central Maven repository. As with any release, we wish to thank all of the users and contributors to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for individual credits, as there are too many to list here. GETTING STARTED In the release package, the examples directory contains several working examples of the core functionality available in Mahout. These can be run via scripts in the examples/bin directory and will prompt you for more information to help you try things out. Most examples do not need a Hadoop cluster in order to run. RELEASE HIGHLIGHTS The highlights of the Apache Mahout 0.9 release include, but are not limited to the list below. For further information, see the included CHANGELOG[1] file. - MAHOUT-1297: Scala DSL Bindings for Mahout Math Linear Algebra. See http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html - MAHOUT-1288: Recommenders as a Search. See https://github.com/pferrel/solr-recommender - MAHOUT-1364: Upgrade Mahout to Lucene 4.6.1 - MAHOUT-1361: Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering See https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf for the details. - MAHOUT-1265: MultiLayer Perceptron (MLP) classifier This is an early implementation of MLP to solicit user feedback, needs to be integrated into Mahout’s processing pipeline to work with Mahout’s vectors. - Removed Deprecated algorithms as they have been either replaced by better performing algorithms or lacked user support and maintenance. - the usual bug fixes. See [2] for more information on the 0.9 release. A total of 113 separate JIRA issues were addressed in this release. The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: - From Clustering: Switched LDA implementation from using Dirtichlet to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage - From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support - Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender - Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy CONTRIBUTING Mahout is always looking for contributions focused on the 3Cs. If you are interested in contributing, please see our contribution page http://mahout.apache.org/developers/how-to-contribute.html or contact us via email at dev@mahout.apache.org. As the project moves towards a 1.0 release, the community will be focused on key algorithms that are proven to scale in production and have seen wide-spread adoption. [1] http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?view=markuppathrev=1563661 [2]
Apache Mahout 0.9 released
The Apache Mahout PMC is pleased to announce the release of Mahout 0.9. Mahout's goal is to build scalable machine learning libraries focused primarily in the areas of collaborative filtering (recommenders), clustering and classification (known collectively as the 3Cs), as well as the necessary infrastructure to support those implementations including, but not limited to, math packages for statistics, linear algebra and others as well as Java primitive collections, local and distributed vector and matrix classes and a variety of integrative code to work with popular packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache Cassandra and much more. The 0.9 release is mainly a clean up release in preparation for an upcoming 1.0 release targeted for first half of 2014, but there are a few significant new features, which are highlighted below. To get started with Apache Mahout 0.9, download the release artifacts and signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central Maven repository. As with any release, we wish to thank all of the users and contributors to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for individual credits, as there are too many to list here. GETTING STARTED In the release package, the examples directory contains several working examples of the core functionality available in Mahout. These can be run via scripts in the examples/bin directory and will prompt you for more information to help you try things out. Most examples do not need a Hadoop cluster in order to run. RELEASE HIGHLIGHTS The highlights of the Apache Mahout 0.9 release include, but are not limited to the list below. For further information, see the included CHANGELOG[1] file. - MAHOUT-1245: A new and improved Mahout website based on Apache CMS - MAHOUT-1265: MultiLayer Perceptron (MLP) classifier This is an early implementation of MLP to solicit user feedback, needs to be integrated into Mahout’s processing pipeline to work with Mahout’s vectors. - MAHOUT-1297: Scala DSL Bindings for Mahout Math Linear Algebra. See http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html - MAHOUT-1288: Recommenders as a Search. See https://github.com/pferrel/solr-recommender - MAHOUT-1300: Suport for easy functional Matrix views and derivatives - MAHOUT-1343: JSON output format for ClusterDumper - MAHOUT-1345: Enable randomised testing for all Mahout modules using Carrot RandomizedRunner. - MAHOUT-1361: Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering. See https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf for the details. - MAHOUT-1364: Upgrade Mahout to Lucene 4.6.1 - Removed Deprecated algorithms as they have been either replaced by better performing algorithms or lacked user support and maintenance. - the usual bug fixes. See [2] for more information on the 0.9 release. A total of 113 separate JIRA issues were addressed in this release. The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: - From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage - From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support - Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender - Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy CONTRIBUTING Mahout is always looking for contributions focused on the 3Cs. If you are interested in contributing, please see our contribution page http://mahout.apache.org/developers/how-to-contribute.html or contact us via email at dev@mahout.apache.org. As the project moves towards a 1.0 release, the community will be focused on key algorithms that are proven to scale in production and have seen wide-spread adoption. [1] http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?view=markuppathrev=1563661 [2] https://issues.apache.org/jira/browse/MAHOUT-1411?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.9%22
Re: Mahout 0.9 Release Notes - First Draft
Below r the release notes, not sure where they should be going on the website. If someone could point me to a location I will go ahead and update the same. = The Apache Mahout PMC is pleased to announce the release of Mahout 0.9. Mahout's goal is to build scalable machine learning libraries focused primarily in the areas of collaborative filtering (recommenders), clustering and classification (known collectively as the 3Cs), as well as the necessary infrastructure to support those implementations including, but not limited to, math packages for statistics, linear algebra and others as well as Java primitive collections, local and distributed vector and matrix classes and a variety of integrative code to work with popular packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache Cassandra and much more. The 0.9 release is mainly a clean up release in preparation for an upcoming 1.0 release targeted for first half of 2014, but there are a few significant new features, which are highlighted below. To get started with Apache Mahout 0.9, download the release artifacts and signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central Maven repository. As with any release, we wish to thank all of the users and contributors to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for individual credits, as there are too many to list here. GETTING STARTED In the release package, the examples directory contains several working examples of the core functionality available in Mahout. These can be run via scripts in the examples/bin directory and will prompt you for more information to help you try things out. Most examples do not need a Hadoop cluster in order to run. RELEASE HIGHLIGHTS The highlights of the Apache Mahout 0.9 release include, but are not limited to the list below. For further information, see the included CHANGELOG[1] file. - MAHOUT-1245: A new and improved Mahout website based on Apache CMS - MAHOUT-1265: MultiLayer Perceptron (MLP) classifier This is an early implementation of MLP to solicit user feedback, needs to be integrated into Mahout’s processing pipeline to work with Mahout’s vectors. - MAHOUT-1297: Scala DSL Bindings for Mahout Math Linear Algebra. See http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html - MAHOUT-1288: Recommenders as a Search. See https://github.com/pferrel/solr-recommender - MAHOUT-1300: Suport for easy functional Matrix views and derivatives - MAHOUT-1343: JSON output format for ClusterDumper - MAHOUT-1345: Enable randomised testing for all Mahout modules using Carrot RandomizedRunner. - MAHOUT-1361: Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering. See https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf for the details. - MAHOUT-1364: Upgrade Mahout to Lucene 4.6.1 - Removed Deprecated algorithms as they have been either replaced by better performing algorithms or lacked user support and maintenance. - the usual bug fixes. See [2] for more information on the 0.9 release. A total of 113 separate JIRA issues were addressed in this release. The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: - From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage - From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support - Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender - Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy CONTRIBUTING Mahout is always looking for contributions focused on the 3Cs. If you are interested in contributing, please see our contribution page http://mahout.apache.org/developers/how-to-contribute.html or contact us via email at dev@mahout.apache.org. As the project moves towards a 1.0 release, the community will be focused on key algorithms that are proven to scale in production and have seen wide-spread adoption. [1] http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?view=markuppathrev=1563661 [2] https://issues.apache.org/jira/browse/MAHOUT-1411?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.9%22 On Monday, February 17, 2014 3:27 PM, Ellen Friedman b.ellen.fried...@gmail.com wrote: Hi Suneel, Thanks for notes. I'm inquiring about status of the notes and update to the website to announce 0.9: Ted has reviewed the release notes - were you waiting
[jira] [Issue Comment Deleted] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel
[ https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-1418: -- Comment: was deleted (was: Created INFRA issue due mid March to execute the permission change.) Removal of write access to anything but CMS for username isabel --- Key: MAHOUT-1418 URL: https://issues.apache.org/jira/browse/MAHOUT-1418 Project: Mahout Issue Type: Task Reporter: Isabel Drost-Fromm Assignee: Isabel Drost-Fromm Priority: Trivial Hi, Please remove write access to user name isabel - effective mid March. For background check the Mahout board report of October last year*. Don't worry - I'm not planning to go completely silent and offline by then. However I know from several years of doing Berlin Buzzwords that being completely sleep deprived is not a good state to commit to subversion - except that when sleep deprived I usually don't remember this insight. So this is my security net forcing me to go through the regular submit patch in jira, get it reviewed and committed cycle (except for documentation changes). * http://www.apache.org/foundation/records/minutes/2013/board_minutes_2013_10_16.txt -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Issue Comment Deleted] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel
[ https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-1418: -- Comment: was deleted (was: Are u serious u really need to do this ?) Removal of write access to anything but CMS for username isabel --- Key: MAHOUT-1418 URL: https://issues.apache.org/jira/browse/MAHOUT-1418 Project: Mahout Issue Type: Task Reporter: Isabel Drost-Fromm Assignee: Isabel Drost-Fromm Priority: Trivial Hi, Please remove write access to user name isabel - effective mid March. For background check the Mahout board report of October last year*. Don't worry - I'm not planning to go completely silent and offline by then. However I know from several years of doing Berlin Buzzwords that being completely sleep deprived is not a good state to commit to subversion - except that when sleep deprived I usually don't remember this insight. So this is my security net forcing me to go through the regular submit patch in jira, get it reviewed and committed cycle (except for documentation changes). * http://www.apache.org/foundation/records/minutes/2013/board_minutes_2013_10_16.txt -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (MAHOUT-1419) Random decision forest is excessively slow on numeric features
Sean Owen created MAHOUT-1419: - Summary: Random decision forest is excessively slow on numeric features Key: MAHOUT-1419 URL: https://issues.apache.org/jira/browse/MAHOUT-1419 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.9, 0.8, 0.7 Reporter: Sean Owen Follow-up to MAHOUT-1417. There's a customer running this and observing it take an unreasonably long time on about 2GB of data -- like, 24 hours when other RDF M/R implementations take 9 minutes. The difference is big enough to probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. I am trying to further improve it. One key issue seems to be how splits are evaluated over numeric features. A split is tried for every distinct numeric value of the feature in the whole data set. Since these are floating point values, they could (and in the customer's case are) all distinct. 200K rows means 200K splits to evaluate every time a node is built on the feature. A better approach is to sample percentiles out of the feature and evaluate only those as splits. Really doing that efficiently would require a lot of rewrite. However, there are some modest changes possible which get some of the benefit, and appear to make it run about 3x faster. That is --on a data set that exhibits this problem -- meaning one using numeric features which are generally distinct. Which is not exotic. There are comparable but different problems with handling of categorical features, but that's for a different patch. I have a patch, but it changes behavior to some extent since it is evaluating only a sample of splits instead of every single possible one. In particular it makes the output of OptIgSplit no longer match the DefaultIgSplit. Although I think the point is that optimized may mean giving different choices of split here, which could yield differing trees. So that test probably has to go. (Along the way I found a number of micro-optimizations in this part of the code that added up to maybe a 3% speedup. And fixed an NPE too.) I will propose a patch shortly with all of this for thoughts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
Thanks a lot Sebastian, Ted and Dmitriy, I'll try Giraph for a performance benchmark. You are right, power iteration is just the most simple form of Lanczos, it shouldn't be in the scope. On Tue 18 Feb 2014 03:59:57 AM EST, Sebastian Schelter wrote: You can also use giraph for a superfast PageRank implementation. Giraph even runs on standard hadoop clusters. Pagerank is usually computed by power iteration, which is much simpler than lanczos or ssvd and only gives the eigenvector associated with the largest eigenvalue. Am 18.02.2014 09:33 schrieb Peng Cheng pc...@uowmail.edu.au: Really? I guess PageRank in mahout was removed due to inherited network bottleneck of mapreduce. But I didn't know MLlib has the implementation. Is mllib implementation based on Lanczos or SSVD? Just curious... On 17/02/2014 11:11 PM, Dmitriy Lyubimov wrote: I bet page rank in mllib in spark finds stationary distribution much faster. On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote: Agreed, and this is the case where Lanczos algorithm is obsolete. My point is: if SSVD is unable to find the eigenvector of asymmetric matrix (this is a common formulation of PageRank, and some random walks, and many other things), then we still have to rely on large-scale Lanczos algorithm. On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote: For the symmetric case, SVD is eigen decomposition. On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote: If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Build failed in Jenkins: mahout-nightly » Mahout Release Package #1501
See https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-distribution/1501/ -- [INFO] [INFO] [INFO] Building Mahout Release Package 1.0-SNAPSHOT [INFO] [INFO] [INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ mahout-distribution --- [INFO] Deleting https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-distribution/ws/target [INFO] [INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact (remove-old-mahout-artifacts) @ mahout-distribution --- [INFO] /home/jenkins/jenkins-slave/maven-repositories/0/org/apache/mahout/mahout-distribution removed. [INFO] [INFO] --- maven-assembly-plugin:2.4:single (bin-assembly) @ mahout-distribution --- [INFO] Reading assembly descriptor: src/main/assembly/bin.xml
Build failed in Jenkins: mahout-nightly #1501
See https://builds.apache.org/job/mahout-nightly/1501/ -- [...truncated 1978 lines...] [INFO] Copying jackson-mapper-asl-1.9.12.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jackson-mapper-asl-1.9.12.jar [INFO] Copying lucene-benchmark-4.6.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-benchmark-4.6.1.jar [INFO] Copying junit-4.11.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/junit-4.11.jar [INFO] Copying lucene-sandbox-4.6.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-sandbox-4.6.1.jar [INFO] Copying slf4j-api-1.7.5.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/slf4j-api-1.7.5.jar [INFO] Copying randomizedtesting-runner-2.0.15.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/randomizedtesting-runner-2.0.15.jar [INFO] Copying jcl-over-slf4j-1.7.5.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jcl-over-slf4j-1.7.5.jar [INFO] Copying mahout-math-1.0-SNAPSHOT-tests.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-math-1.0-SNAPSHOT-tests.jar [INFO] Copying jaxb-api-2.2.2.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jaxb-api-2.2.2.jar [INFO] Copying jettison-1.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jettison-1.1.jar [INFO] Copying lucene-spatial-4.6.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-spatial-4.6.1.jar [INFO] Copying commons-httpclient-3.0.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-httpclient-3.0.1.jar [INFO] Copying commons-compress-1.4.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-compress-1.4.1.jar [INFO] Copying hamcrest-core-1.3.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/hamcrest-core-1.3.jar [INFO] Copying commons-lang-2.4.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-lang-2.4.jar [INFO] Copying lucene-queries-4.6.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-queries-4.6.1.jar [INFO] Copying commons-codec-1.4.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-codec-1.4.jar [INFO] Copying solr-commons-csv-3.5.0.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/solr-commons-csv-3.5.0.jar [INFO] Copying commons-collections-3.2.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-collections-3.2.1.jar [INFO] Copying xercesImpl-2.9.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/xercesImpl-2.9.1.jar [INFO] Copying mahout-core-1.0-SNAPSHOT.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-core-1.0-SNAPSHOT.jar [INFO] Copying xstream-1.4.4.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/xstream-1.4.4.jar [INFO] Copying mahout-integration-1.0-SNAPSHOT.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-integration-1.0-SNAPSHOT.jar [INFO] Copying guava-16.0.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/guava-16.0.jar [INFO] Copying slf4j-log4j12-1.7.5.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/slf4j-log4j12-1.7.5.jar [INFO] Copying cglib-nodep-2.2.2.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/cglib-nodep-2.2.2.jar [INFO] Copying commons-configuration-1.6.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-configuration-1.6.jar [INFO] Copying commons-io-2.4.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-io-2.4.jar [INFO] Copying jackson-xc-1.7.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jackson-xc-1.7.1.jar [INFO] Copying jersey-json-1.8.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jersey-json-1.8.jar [INFO] Copying commons-math-2.1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-math-2.1.jar [INFO] Copying jaxb-impl-2.2.3-1.jar to https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jaxb-impl-2.2.3-1.jar [INFO] Copying jersey-core-1.8.jar to
[jira] [Commented] (MAHOUT-1419) Random decision forest is excessively slow on numeric features
[ https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905056#comment-13905056 ] Ted Dunning commented on MAHOUT-1419: - With t-digest in the OnlineSummarizer now, it is quite possible to have very accurate and quick quantile estimates. That should allow very quick picking of splits as well. The basic idea would be to keep an OnlineSummarizer for each numerical variable. When a split is needed, pick a random number in [0,1) and then the split point is that quantile. If you want 10 random splits, do this 10 times. If you want structured points like every percentile from 20 to 80 %, that is just as simple. It seems like this would be as simple as changing a loop from looping over distinct values to looping over quantile values. Random decision forest is excessively slow on numeric features -- Key: MAHOUT-1419 URL: https://issues.apache.org/jira/browse/MAHOUT-1419 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.7, 0.8, 0.9 Reporter: Sean Owen Follow-up to MAHOUT-1417. There's a customer running this and observing it take an unreasonably long time on about 2GB of data -- like, 24 hours when other RDF M/R implementations take 9 minutes. The difference is big enough to probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. I am trying to further improve it. One key issue seems to be how splits are evaluated over numeric features. A split is tried for every distinct numeric value of the feature in the whole data set. Since these are floating point values, they could (and in the customer's case are) all distinct. 200K rows means 200K splits to evaluate every time a node is built on the feature. A better approach is to sample percentiles out of the feature and evaluate only those as splits. Really doing that efficiently would require a lot of rewrite. However, there are some modest changes possible which get some of the benefit, and appear to make it run about 3x faster. That is --on a data set that exhibits this problem -- meaning one using numeric features which are generally distinct. Which is not exotic. There are comparable but different problems with handling of categorical features, but that's for a different patch. I have a patch, but it changes behavior to some extent since it is evaluating only a sample of splits instead of every single possible one. In particular it makes the output of OptIgSplit no longer match the DefaultIgSplit. Although I think the point is that optimized may mean giving different choices of split here, which could yield differing trees. So that test probably has to go. (Along the way I found a number of micro-optimizations in this part of the code that added up to maybe a 3% speedup. And fixed an NPE too.) I will propose a patch shortly with all of this for thoughts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)