Re: Mathematical background of ALS recommenders

Sebastian Schelter Mon, 25 Mar 2013 04:26:38 -0700

Well in LSI it is ok to do that, as a missing entry means that the
document contains zero occurrences of a given term which is totally fine.


In Collaborative Filtering with explicit feedback, a missing rating is
not automatically a rating of zero, it is simply unknown what the user
would give as rating.

fOR implicit data (number of interactions), a missing entry is indeed
zero, but in most cases you might not have the same confidence in that
observation as if you observed an interaction. Koren's ALS paper
discusses this and introduces constructs to handle this, by putting more
weight on minimizing the loss over observed interactions.

In matrix factorization for CF, the factorization usually has to
minimize the regularized loss over the known entries only. If all
unknown entries were simply considered zero, I'd assume that the
factorization that resulted would not generalize very well to unseen
data. Some researchers title matrix factorization for CF as matrix
completion which IMHO better describes the problem.

On 25.03.2013 12:14, Ted Dunning wrote:
> Well, actually, you can.
> 
> LSI does exactly that.
> 
> What the effect is of doing this is not clear to me.  Do you know what
> happens if you assume missing values are 0?
> 
> On Mon, Mar 25, 2013 at 12:10 PM, Sebastian Schelter <s...@apache.org> wrote:
> 
>> I think one crucial point is missing from this discussion. In
>> Collaborative Filtering the matrix to decompose is only partially
>> defined. A missing entry cannot simply be substituted by 0, it is
>> unknown. Therefore, you cannot use standard SVD techniques as you would
>> use for LSI for example.
>>
>> On 25.03.2013 11:46, Sean Owen wrote:
>>> Points from across several e-mails --
>>>
>>> The initial item-feature matrix can be just random unit vectors too. I
>>> have slightly better results with that.
>>>
>>> You are finding the least-squares solution of A = U M' for U given A
>>> and M. Yes you can derive that analytically as the zero of the
>>> derivative of the error function.
>>>
>>> With m users and n items, and k features, where k=n, then I suppose
>>> you don't need any iterations at all since there is a trivial
>>> solution: U = A, M = I(n) (the nxn identity matrix). You would not
>>> find this on the first iteration, however, if you followed the
>>> algorithm, because you would be starting from some random starting
>>> point. But if you initialized M to the identity matrix, yes you'd find
>>> the exact solution immediately.
>>>
>>> Yes it is an iterative algorithm and you have to define a convergence
>>> criterion. I use average absolute difference in (U M') entries from
>>> one iteration to the next. (Well, a sample.) It's possible that you
>>> reach your criterion in 1 iteration, or, not. It depends on the
>>> criterion. Usually when you restart ALS on updated data, you use the
>>> previous M matrix as a starting point. If the change in data is small,
>>> one iteration should usually leave you still "converged" actually.
>>> But, from random starting point -- not typical.
>>>
>>> ALS is similar to SVD only in broad terms. The SVD is not always used
>>> to make a low-rank factorization, and, its outputs carry more
>>> guarantees -- they are orthonormal bases because it has factored out
>>> scaling factors into the third matrix Sigma. I think of ALS as more
>>> simplistic and therefore possibly faster. WIth k features I believe
>>> (?) the SVD is necessarily a k-iteration process at least, whereas ALS
>>> might be of use after 1 iteration. The SVD is not a "shortcut" for
>>> ALS. If you go to the trouble of a full SVD, you can certainly use
>>> that factorization as-is though.
>>>
>>> You need regularization.
>>>
>>>
>>> It should be pointed out that the "ALS" often spoken of here is not
>>> one that factors the input matrix A. There's a variant, that I have
>>> had good results with, for 'implicit' feedback. There, you are
>>> actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using
>>> values in A as weights in the loss function. You're reconstructing
>>> "interacted or not" and using input value as a confidence measure.
>>> This "works" for ratings although the interpretation in this variant
>>> doesn't line up with the nature of ratings. It works quite nicely for
>>> things like clicks, etc.
>>>
>>> (Much more can be said on this point.)
>>>
>>>
>>>
>>> On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <cont...@dhuebner.com>
>> wrote:
>>>> It's quite hard for me to get the mathematical concepts of the ALS
>>>> recommenders. It would be great if someone could help me to figure out
>>>> the details. This is my current status:
>>>>
>>>> 1. The item-feature (M) matrix is initialized using the average ratings
>>>> and random values (explicit case)
>>>>
>>>> 2. The user-feature (U) matrix is solved using the partial derivative of
>>>> the error function with respect to u_i (the columns of row-vectors of U)
>>>>
>>>> Supposed we use as many features as items are known and the error
>>>> function does not use any regularization. Would U be solved within the
>>>> first iteration? If not, I do not understand why more than one iteration
>>>> is needed.
>>>> Furthermore, I believe to have understood that using fewer features than
>>>> items and also applying regularization, does not allow to solve U in a
>>>> way that the stopping criterion can be met after only one iteration.
>>>> Thus, iteration is required to gradually converge to the stopping
>>>> criterion.
>>>>
>>>> I hope I have pointed out my problems clearly enough.
>>>>
>>
>>
>

Re: Mathematical background of ALS recommenders

Reply via email to