Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Collaborative Filtering with AlS-WR (https://cwiki.apache.org/confluence/display/MAHOUT/Collaborative+Filtering+with+AlS-WR)
Edited by Sebastian Schelter: --------------------------------------------------------------------- h3. Problem setting In Collaborative Filtering we want to predict which items that a user has not yet seen might be highly preferred by him. We assume that users give explicit ratings on a fixed scale (say from 1 to 5). We collect all these ratings in a matrix A. Only a minority of the cells of A is filled. In order to produce recommendations we need to find a way to predict the rating of a user to an item he has not yet seen. This corresponds to filling the blanks in A. We will demonstrate the approach on a toy example. *rating matrix A (users x items)* | 5.00 | 5.00 | 2.00 | \- | | 2.00 | \- | 3.00 | 5.00 | | \- | 5.00 | \- | 3.00 | | 3.00 | \- | \- | 5.00 | h3. The latent factor approach A very successful approach to this problem are the so called _latent factor models_. These model the users and items as points in a k-dimensional feature space. An unknown rating can than simply be estimated by taking the dot product between the corresponding user and item feature vectors. Mathematically spoken, we decompose *A* into two other matrices *U* and *M* whose combination is a good approximation of *A* *user feature matrix U (users x features)* |1.12|1.49|0.48| |1.31|-0.52|0.59| |1.13|0.67|-0.52| |1.39|0.05|0.45| *item feature matrix M (items x features)* |1.81|1.62|0.74| |2.66|1.71|-1.08| |1.73|-0.23|0.78| |3.16|-0.24|0.90| We can than compute an approximation *A_k* of A which has all cells filled. Each previously empty cell now contains a predicted rating. *prediction matrix A_k (users x items)* *A_k = UM'* |4.78|4.98|1.97|3.61| |1.98|1.97|2.85|4.81| |2.75|4.71|1.40|2.94| |2.94|3.32|2.75|4.79| h3. Finding the decomposition Mahout uses _Alternating Least Squares with Weighted Lambda-Regularization_ to find the decomposition. Please refer to [Large-scale Parallel Collaborative Filtering for the Netflix Prize|http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf] (PDF) for details of the algorithm. h3. Taking this approach into production In order to successfully apply this algorithm, a regularization parameter _lambda_ and the number of iterations to run have to be determined. Currently you have to figure these out manually using holdout tests. Mahout includes a toy example in [examples/bin/factorize-movielens-1M.sh|http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/factorize-movielens-1M.sh] that shows how to do a simple holdout test on the Movielens 1 million ratings dataset. It also shows how to compute top-N recommendations from the resulting factorization: h3. Hints Please be aware that ALS-WR is an iterative algorithm. Iterative algorithms currently show poor performance on Hadoop caused by the enormous overhead of scheduling and checkpointing each single iteration. Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
