[ https://issues.apache.org/jira/browse/FLINK-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360390#comment-14360390 ]
ASF GitHub Bot commented on FLINK-1695: --------------------------------------- Github user vasia commented on a diff in the pull request: https://github.com/apache/flink/pull/479#discussion_r26387939 --- Diff: docs/ml/alternating_least_squares.md --- @@ -0,0 +1,157 @@ +--- +mathjax: include +title: Alternating Least Squares +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +* This will be replaced by the TOC +{:toc} + +## Description + +The alternating least squares (ALS) algorithm factorizes a given matrix $R$ into two factors $U$ and $V$ such that $R \approx U^TV$. +The unknown row dimension is given as a parameter to the algorithm and is called latent factors. +Since matrix factorization can be used in the context of recommendation, the matrices $U$ and $V$ can be called user and item matrix, respectively. +The $i$th column of the user matrix is denoted by $u_i$ and the $i$th column of the item matrix is $v_i$. +The matrix $R$ can be called the ratings matrix with $$(R)_{i,j} = r_{i,j}$$. + +In order to find the user and item matrix, the following problem is solved: + +$$\arg\min_{U,V} \sum_{\{i,j\mid r_{i,j} \not= 0\}} \left(r_{i,j} - u_{i}^Tv_{j}\right)^2 + +\lambda \left(\sum_{i} n_{u_i} \left\lVert u_i \right\rVert^2 + \sum_{j} n_{v_j} \left\lVert v_j \right\rVert^2 \right)$$ + +with $\lambda$ being the regularization factor, $$n_{u_i}$$ being the number of items the user $i$ has rated and $$n_{v_j}$$ being the number of times the item $j$ has been rated. +This regularization scheme to avoid overfitting is called weighted-$\lambda$-regularization. +Details can be found in the work of [Zhou et al.](http://dx.doi.org/10.1007/978-3-540-68880-8_32). + +By fixing one of the matrices $U$ or $V$, we obtain a quadratic form which can be solved directly. +The solution of the modified problem is guaranteed to monotonically decrease the overall cost function. +By applying this step alternately to the matrices $U$ and $V$, we can iteratively improve the matrix factorization. + +The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$. + + +## Parameters + +The alternating least squares implementation can be controlled by the following parameters: + + <table class="table table-bordered"> + <thead> + <tr> + <th class="text-left" style="width: 20%">Parameters</th> + <th class="text-center">Description</th> + </tr> + </thead> + + <tbody> + <tr> + <td><strong>NumFactors</strong></td> + <td> + <p> + The number of latent factors to use for the underlying model. + It is equivalent to the dimension of the calculated user and item vectors. + (Default value: <strong>10</strong>) + </p> + </td> + </tr> + <tr> + <td><strong>Lambda</strong></td> + <td> + <p> + Regularization factor. Tune this value in order to avoid overfitting or poor performance due to strong generalization. + (Default value: <strong>1</strong>) + </p> + </td> + </tr> + <tr> + <td><strong>Iterations</strong></td> + <td> + <p> + The maximum number of iterations. + (Default value: <strong>10</strong>) + </p> + </td> + </tr> + <tr> + <td><strong>Blocks</strong></td> + <td> + <p> + The number of blocks into which the user and item matrix a grouped. --- End diff -- *are grouped > Create machine learning library > ------------------------------- > > Key: FLINK-1695 > URL: https://issues.apache.org/jira/browse/FLINK-1695 > Project: Flink > Issue Type: Improvement > Reporter: Till Rohrmann > Assignee: Till Rohrmann > > Create the infrastructure for Flink's machine learning library. This includes > the creation of the module structure and the implementation of basic types > such as vectors and matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332)