[ANNOUNCE] Apache Mahout 0.13.0 Release

2017-04-17 Thread Andrew Musselman
The Apache Mahout PMC is pleased to announce the release of Mahout 0.13.0.
Mahout's goal is to create an environment for quickly creating
machine-learning applications that scale and run on the highest-performance
parallel computation engines available. Mahout comprises an interactive
environment and library that support generalized scalable linear algebra
and include many modern machine-learning algorithms. This release ships
some major changes from 0.12.2 including computation on GPUs and a
simplified framework for building new algorithms.

To get started with Apache Mahout 0.13.0, download the release artifacts
and signatures from http://www.apache.org/dist/mahout/0.13.0/.

Many thanks to the contributors and committers who were part of this
release.


RELEASE HIGHLIGHTS

Mahout-Samsara has implementations for these generalized concepts:

* In-core matrices backed by ViennaCL [3] providing in some cases speedups
of an order of magnitude.
* A JavaCPP bridge to native/GPU operations in ViennaCL
* Distributed GPU Matrix-Matrix and Matrix-Vector multiplication on Spark
* Distributed OpenMP Matrix-Matrix and Matrix-Vector multiplication on Spark
* Sparse and dense matrix GPU-backed support.
* Fault tolerance by falling back to Mahout JVM counterpart of new solvers
in the case of failure on GPU or OpenMP
* A new scikit-learn-like framework for algorithms with the goal for
creating a consistent API for various machine-learning algorithms and an
orderly package structure for grouping regression, classification,
clustering, and pre-processing algorithms together
* New DRM wrappers in Spark Bindings making it more convenient to create
DRMs from MLLib RDDs and DataFrames
* MahoutConversions adds Scala-like compatibility to Vectors introducing
methods such as toArray() and toMap()

Mahout has historically focused on highly scalable algorithms, and since
moving on from MapReduce-based jobs, Mahout now includes some
Mahout-Samsara based implementations:

* Distributed and in-core Stochastic Singular Value Decomposition (SSVD)
* Distributed Principal Component Analysis (PCA)
* Distributed and in-core QR Reduction (QR)
* Distributed Alternating Least Squares (ALS)
* Collaborative Filtering: Item and Row Similarity based on cooccurrence
and supporting multimodal user actions
* Distributed Naive Bayes Training and Classification


STATS

A total of 62 separate JIRA issues are addressed in this release [1].


GETTING STARTED

Download the release artifacts and signatures at
https://mahout.apache.org/general/downloads.html. The examples directory
contains several working examples of the core functionality available in
Mahout. These can be run via scripts in the examples/bin directory. Most
examples do not need a Hadoop cluster in order to run.


FUTURE PLANS

0.13.1

As the project moves towards a 0.13.1 release, we are working on the
following:

* Further Native Integration for increased speedups
* JCuda backing for In-core Matrices and CUDA solvers
* Enumeration across multiple GPUs per JVM instance on a given instance
* GPU/OpenMP Acceleration for linear solvers
* Further integration with other libraries such as MLLib and SparkML
* Scala 2.11 Support
* Spark 2.x Support
* Incorporate more statistical operations
* Runtime probing and optimization of available hardware for caching of
correct/most optimal solver

Post-0.13.1

We already see the need for work in these areas:

* Support for native iterative solvers
* A more robust algorithm library
* Smarter probing and optimization of multiplications

ACKNOWLEDGMENTS

Many thanks to Karl Rupp of the ViennaCL [3] project for his help pushing
the bindings through with his many email exchanges; we greatly appreciate
his involvement. Many thanks as well to Samuel Audet of the JavaCPP [4]
project for his help with the team’s usage of JavaCPP to produce the JNI
layer to produce the native bindings for GPU and OpenMP, which was also key
to this major release.


CONTRIBUTING

If you are interested in contributing, please see our How to Contribute [2]
page or contact us via email at d...@mahout.apache.org.


CREDITS

As with every release, we wish to thank all of the users and contributors
to Mahout. Please see the and JIRA Release Notes [1] for individual
credits, as there are too many to list here.


KNOWN ISSUES:

* The classify-wikipedia.sh example has an outdated link to the data files.
A workaround is to change the download section of the script to:  `curl
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p002336425p003046511.bz2
-o ${WORK_DIR}/wikixml/enwiki-latest-pages-articles.xml.bz2`
* Currently GPU acceleration for supported operations is limited to a
single JVM instance
* Occasional segfault with certain GPU models and computations
* On older GPUs some tests fail when building ViennaCL due to card
limitations
* Currently automatic probing of a system’s hardware happens at each
supported operation, adding some overhead


[1]

Re: Lambda and Kappa CCO

2017-04-17 Thread Pat Ferrel
Ted thinks this can be done with DBs alone. What I proposed was in DBs like 
Solr/Elasticsearch and a persistent event cache (HBase, Cassandra, etc) but 
in-memory models for faster indicator calculations leading to mutable model 
updates in ES/Solr. One primary reason for kappa over lambda is items with 
short life spans or rapidly changing catalogs—things like news. 

The other point for online learning is the volume of data that must be stored 
and re-processed. Kappa only deals with small incremental changes. The resource 
cost of kappa will be much smaller than lambda especially for slowly changing 
models, where most updates will be no-ops.

In any case in kappa there would be no matrix of vector multiply explicitly. If 
we do in-memory data structures I doubt it would be Mahout ones.




On Apr 9, 2017, at 3:43 PM, Trevor Grant  wrote:

Specifically, I hacked together a Lambda Streaming CCO with Spark and Flink
for a demo for my upcoming FlinkForward talk.  Will post code once I finish
it / strip all my creds out. In short- the lack of serialization in Mahout
incore vectors/matrices makes handing off / dealing with them somewhat
tedious.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Sun, Apr 9, 2017 at 5:39 PM, Andrew Palumbo  wrote:

> Pat-
> 
> What can we do from the mahout side?  Would we need any new data
> structures?  Trevor and I were just discussing some of  the troubles of
> near real time matrix streaming.
> --
> *From:* Pat Ferrel 
> *Sent:* Monday, March 27, 2017 2:42:55 PM
> *To:* Ted Dunning; user@mahout.apache.org
> *Cc:* Trevor Grant; Ted Dunning; s...@apache.org
> *Subject:* Re: Lambda and Kappa CCO
> 
> Agreed. Downsampling was ignored in several places and with it a great
> deal of input is a noop. Without downsampling too many things need to
> change.
> 
> Also everything is dependent on this rather vague sentence. “- determine
> if the new interaction element cross-occurs with A and if so calculate the
> llr score”, which needs a lot more explanation. Whether to use Mahout
> in-memory objects or reimplement some in high speed data structures is a
> big question.
> 
> The good thing I noticed in writing this is that model update and real
> time can be arbitrarily far apart, that the system degrades gracefully. So
> during high load it may fall behind but as long as user behavior is
> up-to-date and persisted (it will be) we are still in pretty good shape.
> 
> 
> On Mar 26, 2017, at 6:26 PM, Ted Dunning  wrote:
> 
> 
> I think that this analysis omits the fact that one user interaction causes
> many cooccurrences to change.
> 
> This becomes feasible if you include the effect of down-sampling, but that
> has to be in the algorithm.
> 
> 
> From: Pat Ferrel 
> Sent: Saturday, March 25, 2017 12:01:00 PM
> To: Trevor Grant; user@mahout.apache.org
> Cc: Ted Dunning; s...@apache.org
> Subject: Lambda and Kappa CCO
> 
> This is an overview and proposal for turning the multi-modal Correlated
> Cross-Occurrence (CCO) recommender from Lambda-style into an online
> streaming incrementally updated Kappa-style learner.
> 
> # The CCO Recommender: Lambda-style
> 
> We have largely solved the problems of calculating the multi-modal
> Correlated Cross-Occurrence models and serving recommendations in real time
> from real time user behavior. The model sits in Lucene (Elasticsearch or
> Solr) in a scalable way and the typical query to produce personalized
> recommendations comes from real time user behavior completes with 25ms
> latency.
> 
> # CCO Algorithm
> 
> A = rows are users, columns are items they have “converted” on (purchase,
> read, watch). A represents the conversion event—the interaction that you
> want to recommend.
> B = rows are users columns are items that the user has shown some
> preference for but not necessarily the same items as A. B represent a
> different interaction than A. B might be a preference for some category,
> brand, genre, or just a detailed item page view—or all of these in B, C, D,
> etc
> h_a = a particular user’s history of A type interactions, a vector of
> items that our user converted on.
> h_b = a particular user’s history of B type interactions, a vector of
> items that our user had B type interactions with.
> 
> CCO says:
> 
> [A’A]h_a + [A’B]h_b + [A’C]h_c = r; where r is the weighted items from A
> that represent personalized recommendations for our particular user.
> 
> The innovation here is that A, B, C, … represent multi-modal data.
> Interactions of all types and on item-sets of arbitrary types. In other
> words we can look at virtually any action or possible indicator of user
> preference or taste. We strengthen the above raw