Re: alternating least squares

2013-01-09 Thread Koobas
On Wed, Jan 9, 2013 at 12:40 AM, Sean Owen sro...@gmail.com wrote:

 I think the model you're referring to can use explicit or implicit
 feedback. It's using the values -- however they are derived -- as
 weights in the loss function rather than values to be approximated
 directly. So you still use P even with implicit feedback.

 Of course you can also use ALS to factor R directly if you wanted, also.

 Yes, I see it now.
It is weighted regression, whether explicit or implicit data.
Thank you so much.
I think I finally got the picture.


 Overfitting is as much an issue as in any ML algorithm. Hard to
 quantify it more than that but you certainly don't want to use lambda
 = 0.

 The right value of lambda depends on the data -- depends even more on
 what you mean by lambda! there are different usages in different
 papers. More data means you need less lambda. The effective weight on
 the overfitting / Tikhonov terms is about 1 in my experience -- these
 terms should be weighted roughly like the loss function terms. But
 that can mean using values for lambda much smaller than 1, since
 lambda is just one multiplier of those terms in many formulations.

 The rank has to be greater than the effective rank of the data (of
 course). It's also something you have to fit to the data
 experimentally. For normal-ish data sets of normal-ish size, the right
 number of features is probably 20 - 100. I'd test in that range to
 start.

 More features tends to let the model overfit more, so in theory you
 need more lambda with more features, all else equal.

 It's *really* something you just have to fit to representative sample
 data. The optimal answer is way too dependent on the nature,
 distribution and size of the data to say more than the above.


 On Tue, Jan 8, 2013 at 8:54 PM, Koobas koo...@gmail.com wrote:
  Okay, I got a little bit further in my understanding.
  The matrix of ratings R is replaced with the binary matrix P.
  Then R is used again in regularization.
  I get it.
  This takes care of the situations when you have user-item interactions,
  but you don't have the rating.
  So, it can handle explicit feedback, implicit feedback, and mixed
 (partial
  / missing feedback).
  If I have implicit feedback, I just drop R altogether, right?
 
  Now the only remaining trick is Tikhonov regularization,
  which leads to a couple of questions:
  1) How much of a problem overfitting is?
  2) How do I pick lambda?
  3) How do I pick the rank of the approximation in the first place?
  How does the overfitting problem depend on the rank of the
  approximation?



Re: alternating least squares

2013-01-09 Thread Koobas
Drilling just a bit more.
If I just use simple Tikhonov regularization,
I set both lambdas to identity, and iterate like this (MATLAB):

rank = 50;
for i=1:6,
  Y = inv(X'*X+eye(rank))'*X'*A;
  X = A*Y'*inv(Y*Y'+eye(rank));
end

Now, can I use weighted regularization and preserve the matrix notation?
Because it seems to me that I have to go one row of X, (one column of Y) at
a time.
Is that really so, or am I missing something?


On Wed, Jan 9, 2013 at 10:13 AM, Koobas koo...@gmail.com wrote:



 On Wed, Jan 9, 2013 at 12:40 AM, Sean Owen sro...@gmail.com wrote:

 I think the model you're referring to can use explicit or implicit
 feedback. It's using the values -- however they are derived -- as
 weights in the loss function rather than values to be approximated
 directly. So you still use P even with implicit feedback.

 Of course you can also use ALS to factor R directly if you wanted, also.

 Yes, I see it now.
 It is weighted regression, whether explicit or implicit data.
 Thank you so much.
 I think I finally got the picture.


 Overfitting is as much an issue as in any ML algorithm. Hard to
 quantify it more than that but you certainly don't want to use lambda
 = 0.

 The right value of lambda depends on the data -- depends even more on
 what you mean by lambda! there are different usages in different
 papers. More data means you need less lambda. The effective weight on
 the overfitting / Tikhonov terms is about 1 in my experience -- these
 terms should be weighted roughly like the loss function terms. But
 that can mean using values for lambda much smaller than 1, since
 lambda is just one multiplier of those terms in many formulations.

 The rank has to be greater than the effective rank of the data (of
 course). It's also something you have to fit to the data
 experimentally. For normal-ish data sets of normal-ish size, the right
 number of features is probably 20 - 100. I'd test in that range to
 start.

 More features tends to let the model overfit more, so in theory you
 need more lambda with more features, all else equal.

 It's *really* something you just have to fit to representative sample
 data. The optimal answer is way too dependent on the nature,
 distribution and size of the data to say more than the above.


 On Tue, Jan 8, 2013 at 8:54 PM, Koobas koo...@gmail.com wrote:
  Okay, I got a little bit further in my understanding.
  The matrix of ratings R is replaced with the binary matrix P.
  Then R is used again in regularization.
  I get it.
  This takes care of the situations when you have user-item interactions,
  but you don't have the rating.
  So, it can handle explicit feedback, implicit feedback, and mixed
 (partial
  / missing feedback).
  If I have implicit feedback, I just drop R altogether, right?
 
  Now the only remaining trick is Tikhonov regularization,
  which leads to a couple of questions:
  1) How much of a problem overfitting is?
  2) How do I pick lambda?
  3) How do I pick the rank of the approximation in the first place?
  How does the overfitting problem depend on the rank of the
  approximation?





Re: alternating least squares

2013-01-09 Thread Sean Owen
Yes, Y' * Y becomes Y' * C * Y, where C is a diagonal matrix of
weights. See the preso I sent over for where it shows up.
See the paper I referenced there for how you still compute this
sparse-ly -- the way you've written it here will be dense and won't
scale.

On Wed, Jan 9, 2013 at 1:28 PM, Koobas koo...@gmail.com wrote:
 Drilling just a bit more.
 If I just use simple Tikhonov regularization,
 I set both lambdas to identity, and iterate like this (MATLAB):

 rank = 50;
 for i=1:6,
   Y = inv(X'*X+eye(rank))'*X'*A;
   X = A*Y'*inv(Y*Y'+eye(rank));
 end

 Now, can I use weighted regularization and preserve the matrix notation?
 Because it seems to me that I have to go one row of X, (one column of Y) at
 a time.
 Is that really so, or am I missing something?


 On Wed, Jan 9, 2013 at 10:13 AM, Koobas koo...@gmail.com wrote:



 On Wed, Jan 9, 2013 at 12:40 AM, Sean Owen sro...@gmail.com wrote:

 I think the model you're referring to can use explicit or implicit
 feedback. It's using the values -- however they are derived -- as
 weights in the loss function rather than values to be approximated
 directly. So you still use P even with implicit feedback.

 Of course you can also use ALS to factor R directly if you wanted, also.

 Yes, I see it now.
 It is weighted regression, whether explicit or implicit data.
 Thank you so much.
 I think I finally got the picture.


 Overfitting is as much an issue as in any ML algorithm. Hard to
 quantify it more than that but you certainly don't want to use lambda
 = 0.

 The right value of lambda depends on the data -- depends even more on
 what you mean by lambda! there are different usages in different
 papers. More data means you need less lambda. The effective weight on
 the overfitting / Tikhonov terms is about 1 in my experience -- these
 terms should be weighted roughly like the loss function terms. But
 that can mean using values for lambda much smaller than 1, since
 lambda is just one multiplier of those terms in many formulations.

 The rank has to be greater than the effective rank of the data (of
 course). It's also something you have to fit to the data
 experimentally. For normal-ish data sets of normal-ish size, the right
 number of features is probably 20 - 100. I'd test in that range to
 start.

 More features tends to let the model overfit more, so in theory you
 need more lambda with more features, all else equal.

 It's *really* something you just have to fit to representative sample
 data. The optimal answer is way too dependent on the nature,
 distribution and size of the data to say more than the above.


 On Tue, Jan 8, 2013 at 8:54 PM, Koobas koo...@gmail.com wrote:
  Okay, I got a little bit further in my understanding.
  The matrix of ratings R is replaced with the binary matrix P.
  Then R is used again in regularization.
  I get it.
  This takes care of the situations when you have user-item interactions,
  but you don't have the rating.
  So, it can handle explicit feedback, implicit feedback, and mixed
 (partial
  / missing feedback).
  If I have implicit feedback, I just drop R altogether, right?
 
  Now the only remaining trick is Tikhonov regularization,
  which leads to a couple of questions:
  1) How much of a problem overfitting is?
  2) How do I pick lambda?
  3) How do I pick the rank of the approximation in the first place?
  How does the overfitting problem depend on the rank of the
  approximation?





Re: alternating least squares

2013-01-08 Thread Sebastian Schelter
Hi Koobas,

We have two classes that implement the solutions described in the
ALS-related papers:

For explicit feedback data [1] we have AlternatingLeastSquaresSolver,
for implicit feedback data [2] we have
ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the
org.apache.mahout.math.als package.

Internally the use Mahout's QRDecomposition to solve the linear systems
associated with ALS.

Best,
Sebastian

[1]
http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf
[2] http://research.yahoo.com/pub/2433


On 08.01.2013 21:53, Koobas wrote:
 Perhaps somebody can shed some light on the topic.
 What algorithm is used to solve the least squares problem
 when computing low-rank approximation using Alternating Least Squares?
 



Re: alternating least squares

2013-01-08 Thread Koobas
I am trying to wrap my head around it.
From the Mahout documentation it looks like it's a standard (dense) QR with
Householder reflectors.
But the user-item matrix is usually extremely sparse.
So, is the sparsity somehow taken into account,
or are the sparse right-hand-side vectors packed in dense storage and hit
with Householder?
The underlying question being the computational complexity, i.e. number of
floating point operations involved.


On Tue, Jan 8, 2013 at 4:03 PM, Sebastian Schelter s...@apache.org wrote:

 Hi Koobas,

 We have two classes that implement the solutions described in the
 ALS-related papers:

 For explicit feedback data [1] we have AlternatingLeastSquaresSolver,
 for implicit feedback data [2] we have
 ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the
 org.apache.mahout.math.als package.

 Internally the use Mahout's QRDecomposition to solve the linear systems
 associated with ALS.

 Best,
 Sebastian

 [1]

 http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf
 [2] http://research.yahoo.com/pub/2433


 On 08.01.2013 21:53, Koobas wrote:
  Perhaps somebody can shed some light on the topic.
  What algorithm is used to solve the least squares problem
  when computing low-rank approximation using Alternating Least Squares?
 




Re: alternating least squares

2013-01-08 Thread Ted Dunning
Can you refer to which documentation you are looking at?

ALS is more like a block gradient descent SVD than like a QR.  There are
relationships in each step, but I don't think that they are identical.

Others can comment more authoritatively.

On Tue, Jan 8, 2013 at 2:55 PM, Koobas koo...@gmail.com wrote:

 I am trying to wrap my head around it.
 From the Mahout documentation it looks like it's a standard (dense) QR with
 Householder reflectors.
 But the user-item matrix is usually extremely sparse.
 So, is the sparsity somehow taken into account,
 or are the sparse right-hand-side vectors packed in dense storage and hit
 with Householder?
 The underlying question being the computational complexity, i.e. number of
 floating point operations involved.


 On Tue, Jan 8, 2013 at 4:03 PM, Sebastian Schelter s...@apache.org wrote:

  Hi Koobas,
 
  We have two classes that implement the solutions described in the
  ALS-related papers:
 
  For explicit feedback data [1] we have AlternatingLeastSquaresSolver,
  for implicit feedback data [2] we have
  ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the
  org.apache.mahout.math.als package.
 
  Internally the use Mahout's QRDecomposition to solve the linear systems
  associated with ALS.
 
  Best,
  Sebastian
 
  [1]
 
 
 http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf
  [2] http://research.yahoo.com/pub/2433
 
 
  On 08.01.2013 21:53, Koobas wrote:
   Perhaps somebody can shed some light on the topic.
   What algorithm is used to solve the least squares problem
   when computing low-rank approximation using Alternating Least Squares?
  
 
 



Re: alternating least squares

2013-01-08 Thread Sebastian Schelter
Koobas,

ALS doesn't apply QR decomposition to the user-item-matrix. It
factorizes this matrix into two smaller matrices which contain the user
and item features.

Each user-feature vector is modeled as a linear combination of the
feature vectors of the items which the user rated (and vice versa for
the item-feature vectors).

This factorization is iteratively refined. In each iteration, ALS first
fixes the item-feature vectors and solves a least-squares problem for
each user and then fixes the user-feature vectors and solves a
least-squares problem for each item.

I suggest that you read the papers I referred to in the last mail, which
in detail explain the algorithm.

Best,
Sebastian


On 08.01.2013 23:55, Koobas wrote:
 I am trying to wrap my head around it.
 From the Mahout documentation it looks like it's a standard (dense) QR with
 Householder reflectors.
 But the user-item matrix is usually extremely sparse.
 So, is the sparsity somehow taken into account,
 or are the sparse right-hand-side vectors packed in dense storage and hit
 with Householder?
 The underlying question being the computational complexity, i.e. number of
 floating point operations involved.
 
 
 On Tue, Jan 8, 2013 at 4:03 PM, Sebastian Schelter s...@apache.org wrote:
 
 Hi Koobas,

 We have two classes that implement the solutions described in the
 ALS-related papers:

 For explicit feedback data [1] we have AlternatingLeastSquaresSolver,
 for implicit feedback data [2] we have
 ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the
 org.apache.mahout.math.als package.

 Internally the use Mahout's QRDecomposition to solve the linear systems
 associated with ALS.

 Best,
 Sebastian

 [1]

 http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf
 [2] http://research.yahoo.com/pub/2433


 On 08.01.2013 21:53, Koobas wrote:
 Perhaps somebody can shed some light on the topic.
 What algorithm is used to solve the least squares problem
 when computing low-rank approximation using Alternating Least Squares?



 



Re: alternating least squares

2013-01-08 Thread Mohit Singh
   Pseudo code for ALS in octave

   Users = rand(m,n);
   for i=1:iterations
  Items = A \ Users;
  Users = A' \ Items;
   endfor
   Start with a random guess of the matrix Users,(mXn)
   and then you compute matrix Items. which is a least squares solution to
   Users
   and then calculate Users which is a least square solution to Items and
   iterate until convergence.




On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote:

 Koobas,

 ALS doesn't apply QR decomposition to the user-item-matrix. It
 factorizes this matrix into two smaller matrices which contain the user
 and item features.

 Each user-feature vector is modeled as a linear combination of the
 feature vectors of the items which the user rated (and vice versa for
 the item-feature vectors).

 This factorization is iteratively refined. In each iteration, ALS first
 fixes the item-feature vectors and solves a least-squares problem for
 each user and then fixes the user-feature vectors and solves a
 least-squares problem for each item.

 I suggest that you read the papers I referred to in the last mail, which
 in detail explain the algorithm.

 Best,
 Sebastian


 On 08.01.2013 23:55, Koobas wrote:
  I am trying to wrap my head around it.
  From the Mahout documentation it looks like it's a standard (dense) QR
 with
  Householder reflectors.
  But the user-item matrix is usually extremely sparse.
  So, is the sparsity somehow taken into account,
  or are the sparse right-hand-side vectors packed in dense storage and hit
  with Householder?
  The underlying question being the computational complexity, i.e. number
 of
  floating point operations involved.
 
 
  On Tue, Jan 8, 2013 at 4:03 PM, Sebastian Schelter s...@apache.org
 wrote:
 
  Hi Koobas,
 
  We have two classes that implement the solutions described in the
  ALS-related papers:
 
  For explicit feedback data [1] we have AlternatingLeastSquaresSolver,
  for implicit feedback data [2] we have
  ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the
  org.apache.mahout.math.als package.
 
  Internally the use Mahout's QRDecomposition to solve the linear systems
  associated with ALS.
 
  Best,
  Sebastian
 
  [1]
 
 
 http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf
  [2] http://research.yahoo.com/pub/2433
 
 
  On 08.01.2013 21:53, Koobas wrote:
  Perhaps somebody can shed some light on the topic.
  What algorithm is used to solve the least squares problem
  when computing low-rank approximation using Alternating Least Squares?
 
 
 
 




-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Re: alternating least squares

2013-01-08 Thread Sean Owen
That's right, though 80% of the algorithm is in how you efficiently do
A \ Users, and in the presence of an overfitting term. You can
definitely do it sparse-ly when the loss function is set up cleverly.

I tried to explain it here, from slide 6:
http://www.slideshare.net/srowen/big-practical-recommendations-with-alternating-least-squares

On Tue, Jan 8, 2013 at 5:16 PM, Mohit Singh mohit1...@gmail.com wrote:
Pseudo code for ALS in octave

Users = rand(m,n);
for i=1:iterations
   Items = A \ Users;
   Users = A' \ Items;
endfor
Start with a random guess of the matrix Users,(mXn)
and then you compute matrix Items. which is a least squares solution to
Users
and then calculate Users which is a least square solution to Items and
iterate until convergence.




 On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote:

 Koobas,

 ALS doesn't apply QR decomposition to the user-item-matrix. It
 factorizes this matrix into two smaller matrices which contain the user
 and item features.

 Each user-feature vector is modeled as a linear combination of the
 feature vectors of the items which the user rated (and vice versa for
 the item-feature vectors).

 This factorization is iteratively refined. In each iteration, ALS first
 fixes the item-feature vectors and solves a least-squares problem for
 each user and then fixes the user-feature vectors and solves a
 least-squares problem for each item.

 I suggest that you read the papers I referred to in the last mail, which
 in detail explain the algorithm.

 Best,
 Sebastian


 On 08.01.2013 23:55, Koobas wrote:
  I am trying to wrap my head around it.
  From the Mahout documentation it looks like it's a standard (dense) QR
 with
  Householder reflectors.
  But the user-item matrix is usually extremely sparse.
  So, is the sparsity somehow taken into account,
  or are the sparse right-hand-side vectors packed in dense storage and hit
  with Householder?
  The underlying question being the computational complexity, i.e. number
 of
  floating point operations involved.
 
 
  On Tue, Jan 8, 2013 at 4:03 PM, Sebastian Schelter s...@apache.org
 wrote:
 
  Hi Koobas,
 
  We have two classes that implement the solutions described in the
  ALS-related papers:
 
  For explicit feedback data [1] we have AlternatingLeastSquaresSolver,
  for implicit feedback data [2] we have
  ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the
  org.apache.mahout.math.als package.
 
  Internally the use Mahout's QRDecomposition to solve the linear systems
  associated with ALS.
 
  Best,
  Sebastian
 
  [1]
 
 
 http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf
  [2] http://research.yahoo.com/pub/2433
 
 
  On 08.01.2013 21:53, Koobas wrote:
  Perhaps somebody can shed some light on the topic.
  What algorithm is used to solve the least squares problem
  when computing low-rank approximation using Alternating Least Squares?
 
 
 
 




 --
 Mohit

 When you want success as badly as you want the air, then you will get it.
 There is no other secret of success.
 -Socrates


Re: alternating least squares

2013-01-08 Thread Ted Dunning
This particular part of the algorithm can be seen as similar to a least
squares problem that might normally be solved by QR.  I don't think that
the updates are quite the same, however.

On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote:

 This factorization is iteratively refined. In each iteration, ALS first
 fixes the item-feature vectors and solves a least-squares problem for
 each user and then fixes the user-feature vectors and solves a
 least-squares problem for each item.



Re: alternating least squares

2013-01-08 Thread Sean Owen
There's definitely a QR decomposition in there for me since solving A
= X Y' for X  is  X = A Y (Y' * Y)^-1  and you need some means to
compute the inverse of that (small) matrix.

On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 This particular part of the algorithm can be seen as similar to a least
 squares problem that might normally be solved by QR.  I don't think that
 the updates are quite the same, however.

 On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote:

 This factorization is iteratively refined. In each iteration, ALS first
 fixes the item-feature vectors and solves a least-squares problem for
 each user and then fixes the user-feature vectors and solves a
 least-squares problem for each item.



Re: alternating least squares

2013-01-08 Thread Koobas
On Tue, Jan 8, 2013 at 6:41 PM, Sean Owen sro...@gmail.com wrote:

 There's definitely a QR decomposition in there for me since solving A
 = X Y' for X  is  X = A Y (Y' * Y)^-1  and you need some means to
 compute the inverse of that (small) matrix.


Sean,
I think I got it.
1) A Y is a handful of sparse matrix-vector products,
2) Y' Y is a dense matrix-matrix on a flat matrix and a tall matrix,
producing a small square matrix,
3) inverting that matrix is not a big deal, since it is small.
Great!
Thanks!
It just was not immediately obvious to me at first look.

Now, the transition from ratings to 1s and 0s,
is this simply to handle implicit feedback,
or is this for some other reason?



 On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com wrote:
  This particular part of the algorithm can be seen as similar to a least
  squares problem that might normally be solved by QR.  I don't think that
  the updates are quite the same, however.
 
  On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org
 wrote:
 
  This factorization is iteratively refined. In each iteration, ALS first
  fixes the item-feature vectors and solves a least-squares problem for
  each user and then fixes the user-feature vectors and solves a
  least-squares problem for each item.
 



Re: alternating least squares

2013-01-08 Thread Ted Dunning
But is it actually QR of Y?

On Tue, Jan 8, 2013 at 3:41 PM, Sean Owen sro...@gmail.com wrote:

 There's definitely a QR decomposition in there for me since solving A
 = X Y' for X  is  X = A Y (Y' * Y)^-1  and you need some means to
 compute the inverse of that (small) matrix.

 On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com wrote:
  This particular part of the algorithm can be seen as similar to a least
  squares problem that might normally be solved by QR.  I don't think that
  the updates are quite the same, however.
 
  On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org
 wrote:
 
  This factorization is iteratively refined. In each iteration, ALS first
  fixes the item-feature vectors and solves a least-squares problem for
  each user and then fixes the user-feature vectors and solves a
  least-squares problem for each item.
 



Re: alternating least squares

2013-01-08 Thread Koobas
On Tue, Jan 8, 2013 at 7:18 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 But is it actually QR of Y?


Ted,
This is my understanding:
In the process of solving the least squares problem,
you end up inverting a small square matrix (Y' * Y)-1.
How it is done is irrelevant.
Since the matrix is square, one could do LU factorization, a.k.a. Gaussian
elimination.
However, since we are talking here about solving an 100x100 problem,
one might as well do it with QR factorization which, unlike LU, is stable
no matter what.



 On Tue, Jan 8, 2013 at 3:41 PM, Sean Owen sro...@gmail.com wrote:

  There's definitely a QR decomposition in there for me since solving A
  = X Y' for X  is  X = A Y (Y' * Y)^-1  and you need some means to
  compute the inverse of that (small) matrix.
 
  On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
   This particular part of the algorithm can be seen as similar to a least
   squares problem that might normally be solved by QR.  I don't think
 that
   the updates are quite the same, however.
  
   On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org
  wrote:
  
   This factorization is iteratively refined. In each iteration, ALS
 first
   fixes the item-feature vectors and solves a least-squares problem for
   each user and then fixes the user-feature vectors and solves a
   least-squares problem for each item.
  
 



Re: alternating least squares

2013-01-08 Thread Ted Dunning
Great.

On Tue, Jan 8, 2013 at 4:25 PM, Koobas koo...@gmail.com wrote:

 On Tue, Jan 8, 2013 at 7:18 PM, Ted Dunning ted.dunn...@gmail.com wrote:

  But is it actually QR of Y?
 
 
 Ted,
 This is my understanding:
 In the process of solving the least squares problem,
 you end up inverting a small square matrix (Y' * Y)-1.
 How it is done is irrelevant.
 Since the matrix is square, one could do LU factorization, a.k.a. Gaussian
 elimination.
 However, since we are talking here about solving an 100x100 problem,
 one might as well do it with QR factorization which, unlike LU, is stable
 no matter what.



  On Tue, Jan 8, 2013 at 3:41 PM, Sean Owen sro...@gmail.com wrote:
 
   There's definitely a QR decomposition in there for me since solving A
   = X Y' for X  is  X = A Y (Y' * Y)^-1  and you need some means to
   compute the inverse of that (small) matrix.
  
   On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
This particular part of the algorithm can be seen as similar to a
 least
squares problem that might normally be solved by QR.  I don't think
  that
the updates are quite the same, however.
   
On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org
   wrote:
   
This factorization is iteratively refined. In each iteration, ALS
  first
fixes the item-feature vectors and solves a least-squares problem
 for
each user and then fixes the user-feature vectors and solves a
least-squares problem for each item.
   
  
 



Re: alternating least squares

2013-01-08 Thread Koobas
On Tue, Jan 8, 2013 at 7:17 PM, Koobas koo...@gmail.com wrote:



 On Tue, Jan 8, 2013 at 6:41 PM, Sean Owen sro...@gmail.com wrote:

 There's definitely a QR decomposition in there for me since solving A
 = X Y' for X  is  X = A Y (Y' * Y)^-1  and you need some means to
 compute the inverse of that (small) matrix.


 Sean,
 I think I got it.
 1) A Y is a handful of sparse matrix-vector products,
 2) Y' Y is a dense matrix-matrix on a flat matrix and a tall matrix,
 producing a small square matrix,
 3) inverting that matrix is not a big deal, since it is small.
 Great!
 Thanks!
 It just was not immediately obvious to me at first look.

 Now, the transition from ratings to 1s and 0s,
 is this simply to handle implicit feedback,
 or is this for some other reason?

 Okay, I got a little bit further in my understanding.
The matrix of ratings R is replaced with the binary matrix P.
Then R is used again in regularization.
I get it.
This takes care of the situations when you have user-item interactions,
but you don't have the rating.
So, it can handle explicit feedback, implicit feedback, and mixed (partial
/ missing feedback).
If I have implicit feedback, I just drop R altogether, right?

Now the only remaining trick is Tikhonov regularization,
which leads to a couple of questions:
1) How much of a problem overfitting is?
2) How do I pick lambda?
3) How do I pick the rank of the approximation in the first place?
How does the overfitting problem depend on the rank of the
approximation?



 On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  This particular part of the algorithm can be seen as similar to a least
  squares problem that might normally be solved by QR.  I don't think that
  the updates are quite the same, however.
 
  On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org
 wrote:
 
  This factorization is iteratively refined. In each iteration, ALS first
  fixes the item-feature vectors and solves a least-squares problem for
  each user and then fixes the user-feature vectors and solves a
  least-squares problem for each item.
 





Re: alternating least squares

2013-01-08 Thread Sean Owen
I think the model you're referring to can use explicit or implicit
feedback. It's using the values -- however they are derived -- as
weights in the loss function rather than values to be approximated
directly. So you still use P even with implicit feedback.

Of course you can also use ALS to factor R directly if you wanted, also.

Overfitting is as much an issue as in any ML algorithm. Hard to
quantify it more than that but you certainly don't want to use lambda
= 0.

The right value of lambda depends on the data -- depends even more on
what you mean by lambda! there are different usages in different
papers. More data means you need less lambda. The effective weight on
the overfitting / Tikhonov terms is about 1 in my experience -- these
terms should be weighted roughly like the loss function terms. But
that can mean using values for lambda much smaller than 1, since
lambda is just one multiplier of those terms in many formulations.

The rank has to be greater than the effective rank of the data (of
course). It's also something you have to fit to the data
experimentally. For normal-ish data sets of normal-ish size, the right
number of features is probably 20 - 100. I'd test in that range to
start.

More features tends to let the model overfit more, so in theory you
need more lambda with more features, all else equal.

It's *really* something you just have to fit to representative sample
data. The optimal answer is way too dependent on the nature,
distribution and size of the data to say more than the above.


On Tue, Jan 8, 2013 at 8:54 PM, Koobas koo...@gmail.com wrote:
 Okay, I got a little bit further in my understanding.
 The matrix of ratings R is replaced with the binary matrix P.
 Then R is used again in regularization.
 I get it.
 This takes care of the situations when you have user-item interactions,
 but you don't have the rating.
 So, it can handle explicit feedback, implicit feedback, and mixed (partial
 / missing feedback).
 If I have implicit feedback, I just drop R altogether, right?

 Now the only remaining trick is Tikhonov regularization,
 which leads to a couple of questions:
 1) How much of a problem overfitting is?
 2) How do I pick lambda?
 3) How do I pick the rank of the approximation in the first place?
 How does the overfitting problem depend on the rank of the
 approximation?