Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Collaborative Filtering with AlS-WR 
(https://cwiki.apache.org/confluence/display/MAHOUT/Collaborative+Filtering+with+AlS-WR)

Added by Sebastian Schelter:
---------------------------------------------------------------------
Collaborative Filtering using Alternating Least Squares

h2. Algorithm

h3. Problem setting

In Collaborative Filtering we want to predict which items that a user has not 
yet seen might be highly preferred by him. We assume that users give explicit 
ratings of fixed scale (say from 1 to 5). We collect all these ratings in a 
matrix A. Only a minority of the cells of A is filled. In order to produce 
recommendations we need to find a way to predict the rating of a user to an 
item he has not yet seen. This corresponds to filling the blanks in A. We will 
demonstrate the approach on a toy example.


*rating matrix A (users x items)*

| 5.00 | 5.00 | 2.00 | \- |
| 2.00 | \- | 3.00 | 5.00 |
| \- | 5.00 | \- | 3.00 |
| 3.00 | \- | \- | 5.00 |


h3. The latent factor approach

A very successfull approach to this problem are the so called _latent factor 
models_. These model the users and items as points a k-dimensional feature 
space. An unknown rating can than simply be estimated by the dot product 
between the corresponding user and item feature vectors.

Mathematically spoken, we decompose *A* into two other matrices *U* and *M* 
whose combination is a good approximation of *A* 

*user feature matrix U (users x features)*

|1.12|1.49|0.48|
|1.31|-0.52|0.59|
|1.13|0.67|-0.52|
|1.39|0.05|0.45|

*item feature matrix M (items x features)*

|1.81|1.62|0.74|
|2.66|1.71|-1.08|
|1.73|-0.23|0.78|
|3.16|-0.24|0.90|

We can than compute an approximation *A_k* of A which has all cells filled. 
Each previously empty cell now contains a predicted rating.

*prediction matrix A_k (users x items)*

*A_k = UM'*

|4.78|4.98|1.97|3.61|
|1.98|1.97|2.85|4.81|
|2.75|4.71|1.40|2.94|
|2.94|3.32|2.75|4.79|


h3. Finding the decomposition

Mahout uses _Alternating Least Squares with Weighted Lambda-Regularization_ to 
find the decomposition. Please refer to [Large-scale Parallel Collaborative 
Filtering for the Netflix 
Prize|http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf]
 (PDF) for details of the algorithm.

h3. Taking this approach into production

In order to successfully applies this algorithm, a regularization parameter 
_lambda_ and the number of iterations to run have to be determined. Currently 
users have to figure these out manually using holdout tests.

Mahout includes a toy example in 
[examples/bin/factorize-movielens-1M.sh|http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/factorize-movielens-1M.sh]
 that shows how to do a simple holdout test on the Movielens 1 million ratings 
dataset. It also shows how to compute top-N recommendations from the resulting 
factorization:

h3. Hints

Please be aware that ALS-WR is an iterative algorithm. Iterative algorithms 
currently show poor performance on Hadoop caused by the enormous overhead of 
scheduling and checkpointing each single iteration.


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

Reply via email to