[ANNOUNCE] Apache Mahout 0.13.0 Release
The Apache Mahout PMC is pleased to announce the release of Mahout 0.13.0. Mahout's goal is to create an environment for quickly creating machine-learning applications that scale and run on the highest-performance parallel computation engines available. Mahout comprises an interactive environment and library that support generalized scalable linear algebra and include many modern machine-learning algorithms. This release ships some major changes from 0.12.2 including computation on GPUs and a simplified framework for building new algorithms. To get started with Apache Mahout 0.13.0, download the release artifacts and signatures from http://www.apache.org/dist/mahout/0.13.0/. Many thanks to the contributors and committers who were part of this release. RELEASE HIGHLIGHTS Mahout-Samsara has implementations for these generalized concepts: * In-core matrices backed by ViennaCL [3] providing in some cases speedups of an order of magnitude. * A JavaCPP bridge to native/GPU operations in ViennaCL * Distributed GPU Matrix-Matrix and Matrix-Vector multiplication on Spark * Distributed OpenMP Matrix-Matrix and Matrix-Vector multiplication on Spark * Sparse and dense matrix GPU-backed support. * Fault tolerance by falling back to Mahout JVM counterpart of new solvers in the case of failure on GPU or OpenMP * A new scikit-learn-like framework for algorithms with the goal for creating a consistent API for various machine-learning algorithms and an orderly package structure for grouping regression, classification, clustering, and pre-processing algorithms together * New DRM wrappers in Spark Bindings making it more convenient to create DRMs from MLLib RDDs and DataFrames * MahoutConversions adds Scala-like compatibility to Vectors introducing methods such as toArray() and toMap() Mahout has historically focused on highly scalable algorithms, and since moving on from MapReduce-based jobs, Mahout now includes some Mahout-Samsara based implementations: * Distributed and in-core Stochastic Singular Value Decomposition (SSVD) * Distributed Principal Component Analysis (PCA) * Distributed and in-core QR Reduction (QR) * Distributed Alternating Least Squares (ALS) * Collaborative Filtering: Item and Row Similarity based on cooccurrence and supporting multimodal user actions * Distributed Naive Bayes Training and Classification STATS A total of 62 separate JIRA issues are addressed in this release [1]. GETTING STARTED Download the release artifacts and signatures at https://mahout.apache.org/general/downloads.html. The examples directory contains several working examples of the core functionality available in Mahout. These can be run via scripts in the examples/bin directory. Most examples do not need a Hadoop cluster in order to run. FUTURE PLANS 0.13.1 As the project moves towards a 0.13.1 release, we are working on the following: * Further Native Integration for increased speedups * JCuda backing for In-core Matrices and CUDA solvers * Enumeration across multiple GPUs per JVM instance on a given instance * GPU/OpenMP Acceleration for linear solvers * Further integration with other libraries such as MLLib and SparkML * Scala 2.11 Support * Spark 2.x Support * Incorporate more statistical operations * Runtime probing and optimization of available hardware for caching of correct/most optimal solver Post-0.13.1 We already see the need for work in these areas: * Support for native iterative solvers * A more robust algorithm library * Smarter probing and optimization of multiplications ACKNOWLEDGMENTS Many thanks to Karl Rupp of the ViennaCL [3] project for his help pushing the bindings through with his many email exchanges; we greatly appreciate his involvement. Many thanks as well to Samuel Audet of the JavaCPP [4] project for his help with the team’s usage of JavaCPP to produce the JNI layer to produce the native bindings for GPU and OpenMP, which was also key to this major release. CONTRIBUTING If you are interested in contributing, please see our How to Contribute [2] page or contact us via email at d...@mahout.apache.org. CREDITS As with every release, we wish to thank all of the users and contributors to Mahout. Please see the and JIRA Release Notes [1] for individual credits, as there are too many to list here. KNOWN ISSUES: * The classify-wikipedia.sh example has an outdated link to the data files. A workaround is to change the download section of the script to: `curl https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p002336425p003046511.bz2 -o ${WORK_DIR}/wikixml/enwiki-latest-pages-articles.xml.bz2` * Currently GPU acceleration for supported operations is limited to a single JVM instance * Occasional segfault with certain GPU models and computations * On older GPUs some tests fail when building ViennaCL due to card limitations * Currently automatic probing of a system’s hardware happens at each supported operation, adding some overhead [1]
Re: Lambda and Kappa CCO
Ted thinks this can be done with DBs alone. What I proposed was in DBs like Solr/Elasticsearch and a persistent event cache (HBase, Cassandra, etc) but in-memory models for faster indicator calculations leading to mutable model updates in ES/Solr. One primary reason for kappa over lambda is items with short life spans or rapidly changing catalogs—things like news. The other point for online learning is the volume of data that must be stored and re-processed. Kappa only deals with small incremental changes. The resource cost of kappa will be much smaller than lambda especially for slowly changing models, where most updates will be no-ops. In any case in kappa there would be no matrix of vector multiply explicitly. If we do in-memory data structures I doubt it would be Mahout ones. On Apr 9, 2017, at 3:43 PM, Trevor Grantwrote: Specifically, I hacked together a Lambda Streaming CCO with Spark and Flink for a demo for my upcoming FlinkForward talk. Will post code once I finish it / strip all my creds out. In short- the lack of serialization in Mahout incore vectors/matrices makes handing off / dealing with them somewhat tedious. Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Sun, Apr 9, 2017 at 5:39 PM, Andrew Palumbo wrote: > Pat- > > What can we do from the mahout side? Would we need any new data > structures? Trevor and I were just discussing some of the troubles of > near real time matrix streaming. > -- > *From:* Pat Ferrel > *Sent:* Monday, March 27, 2017 2:42:55 PM > *To:* Ted Dunning; user@mahout.apache.org > *Cc:* Trevor Grant; Ted Dunning; s...@apache.org > *Subject:* Re: Lambda and Kappa CCO > > Agreed. Downsampling was ignored in several places and with it a great > deal of input is a noop. Without downsampling too many things need to > change. > > Also everything is dependent on this rather vague sentence. “- determine > if the new interaction element cross-occurs with A and if so calculate the > llr score”, which needs a lot more explanation. Whether to use Mahout > in-memory objects or reimplement some in high speed data structures is a > big question. > > The good thing I noticed in writing this is that model update and real > time can be arbitrarily far apart, that the system degrades gracefully. So > during high load it may fall behind but as long as user behavior is > up-to-date and persisted (it will be) we are still in pretty good shape. > > > On Mar 26, 2017, at 6:26 PM, Ted Dunning wrote: > > > I think that this analysis omits the fact that one user interaction causes > many cooccurrences to change. > > This becomes feasible if you include the effect of down-sampling, but that > has to be in the algorithm. > > > From: Pat Ferrel > Sent: Saturday, March 25, 2017 12:01:00 PM > To: Trevor Grant; user@mahout.apache.org > Cc: Ted Dunning; s...@apache.org > Subject: Lambda and Kappa CCO > > This is an overview and proposal for turning the multi-modal Correlated > Cross-Occurrence (CCO) recommender from Lambda-style into an online > streaming incrementally updated Kappa-style learner. > > # The CCO Recommender: Lambda-style > > We have largely solved the problems of calculating the multi-modal > Correlated Cross-Occurrence models and serving recommendations in real time > from real time user behavior. The model sits in Lucene (Elasticsearch or > Solr) in a scalable way and the typical query to produce personalized > recommendations comes from real time user behavior completes with 25ms > latency. > > # CCO Algorithm > > A = rows are users, columns are items they have “converted” on (purchase, > read, watch). A represents the conversion event—the interaction that you > want to recommend. > B = rows are users columns are items that the user has shown some > preference for but not necessarily the same items as A. B represent a > different interaction than A. B might be a preference for some category, > brand, genre, or just a detailed item page view—or all of these in B, C, D, > etc > h_a = a particular user’s history of A type interactions, a vector of > items that our user converted on. > h_b = a particular user’s history of B type interactions, a vector of > items that our user had B type interactions with. > > CCO says: > > [A’A]h_a + [A’B]h_b + [A’C]h_c = r; where r is the weighted items from A > that represent personalized recommendations for our particular user. > > The innovation here is that A, B, C, … represent multi-modal data. > Interactions of all types and on item-sets of arbitrary types. In other > words we can look at virtually any action or possible indicator of user > preference or taste. We strengthen the above raw