Re: alternating least squares
On Wed, Jan 9, 2013 at 12:40 AM, Sean Owen sro...@gmail.com wrote: I think the model you're referring to can use explicit or implicit feedback. It's using the values -- however they are derived -- as weights in the loss function rather than values to be approximated directly. So you still use P even with implicit feedback. Of course you can also use ALS to factor R directly if you wanted, also. Yes, I see it now. It is weighted regression, whether explicit or implicit data. Thank you so much. I think I finally got the picture. Overfitting is as much an issue as in any ML algorithm. Hard to quantify it more than that but you certainly don't want to use lambda = 0. The right value of lambda depends on the data -- depends even more on what you mean by lambda! there are different usages in different papers. More data means you need less lambda. The effective weight on the overfitting / Tikhonov terms is about 1 in my experience -- these terms should be weighted roughly like the loss function terms. But that can mean using values for lambda much smaller than 1, since lambda is just one multiplier of those terms in many formulations. The rank has to be greater than the effective rank of the data (of course). It's also something you have to fit to the data experimentally. For normal-ish data sets of normal-ish size, the right number of features is probably 20 - 100. I'd test in that range to start. More features tends to let the model overfit more, so in theory you need more lambda with more features, all else equal. It's *really* something you just have to fit to representative sample data. The optimal answer is way too dependent on the nature, distribution and size of the data to say more than the above. On Tue, Jan 8, 2013 at 8:54 PM, Koobas koo...@gmail.com wrote: Okay, I got a little bit further in my understanding. The matrix of ratings R is replaced with the binary matrix P. Then R is used again in regularization. I get it. This takes care of the situations when you have user-item interactions, but you don't have the rating. So, it can handle explicit feedback, implicit feedback, and mixed (partial / missing feedback). If I have implicit feedback, I just drop R altogether, right? Now the only remaining trick is Tikhonov regularization, which leads to a couple of questions: 1) How much of a problem overfitting is? 2) How do I pick lambda? 3) How do I pick the rank of the approximation in the first place? How does the overfitting problem depend on the rank of the approximation?
Re: alternating least squares
Drilling just a bit more. If I just use simple Tikhonov regularization, I set both lambdas to identity, and iterate like this (MATLAB): rank = 50; for i=1:6, Y = inv(X'*X+eye(rank))'*X'*A; X = A*Y'*inv(Y*Y'+eye(rank)); end Now, can I use weighted regularization and preserve the matrix notation? Because it seems to me that I have to go one row of X, (one column of Y) at a time. Is that really so, or am I missing something? On Wed, Jan 9, 2013 at 10:13 AM, Koobas koo...@gmail.com wrote: On Wed, Jan 9, 2013 at 12:40 AM, Sean Owen sro...@gmail.com wrote: I think the model you're referring to can use explicit or implicit feedback. It's using the values -- however they are derived -- as weights in the loss function rather than values to be approximated directly. So you still use P even with implicit feedback. Of course you can also use ALS to factor R directly if you wanted, also. Yes, I see it now. It is weighted regression, whether explicit or implicit data. Thank you so much. I think I finally got the picture. Overfitting is as much an issue as in any ML algorithm. Hard to quantify it more than that but you certainly don't want to use lambda = 0. The right value of lambda depends on the data -- depends even more on what you mean by lambda! there are different usages in different papers. More data means you need less lambda. The effective weight on the overfitting / Tikhonov terms is about 1 in my experience -- these terms should be weighted roughly like the loss function terms. But that can mean using values for lambda much smaller than 1, since lambda is just one multiplier of those terms in many formulations. The rank has to be greater than the effective rank of the data (of course). It's also something you have to fit to the data experimentally. For normal-ish data sets of normal-ish size, the right number of features is probably 20 - 100. I'd test in that range to start. More features tends to let the model overfit more, so in theory you need more lambda with more features, all else equal. It's *really* something you just have to fit to representative sample data. The optimal answer is way too dependent on the nature, distribution and size of the data to say more than the above. On Tue, Jan 8, 2013 at 8:54 PM, Koobas koo...@gmail.com wrote: Okay, I got a little bit further in my understanding. The matrix of ratings R is replaced with the binary matrix P. Then R is used again in regularization. I get it. This takes care of the situations when you have user-item interactions, but you don't have the rating. So, it can handle explicit feedback, implicit feedback, and mixed (partial / missing feedback). If I have implicit feedback, I just drop R altogether, right? Now the only remaining trick is Tikhonov regularization, which leads to a couple of questions: 1) How much of a problem overfitting is? 2) How do I pick lambda? 3) How do I pick the rank of the approximation in the first place? How does the overfitting problem depend on the rank of the approximation?
Re: alternating least squares
Yes, Y' * Y becomes Y' * C * Y, where C is a diagonal matrix of weights. See the preso I sent over for where it shows up. See the paper I referenced there for how you still compute this sparse-ly -- the way you've written it here will be dense and won't scale. On Wed, Jan 9, 2013 at 1:28 PM, Koobas koo...@gmail.com wrote: Drilling just a bit more. If I just use simple Tikhonov regularization, I set both lambdas to identity, and iterate like this (MATLAB): rank = 50; for i=1:6, Y = inv(X'*X+eye(rank))'*X'*A; X = A*Y'*inv(Y*Y'+eye(rank)); end Now, can I use weighted regularization and preserve the matrix notation? Because it seems to me that I have to go one row of X, (one column of Y) at a time. Is that really so, or am I missing something? On Wed, Jan 9, 2013 at 10:13 AM, Koobas koo...@gmail.com wrote: On Wed, Jan 9, 2013 at 12:40 AM, Sean Owen sro...@gmail.com wrote: I think the model you're referring to can use explicit or implicit feedback. It's using the values -- however they are derived -- as weights in the loss function rather than values to be approximated directly. So you still use P even with implicit feedback. Of course you can also use ALS to factor R directly if you wanted, also. Yes, I see it now. It is weighted regression, whether explicit or implicit data. Thank you so much. I think I finally got the picture. Overfitting is as much an issue as in any ML algorithm. Hard to quantify it more than that but you certainly don't want to use lambda = 0. The right value of lambda depends on the data -- depends even more on what you mean by lambda! there are different usages in different papers. More data means you need less lambda. The effective weight on the overfitting / Tikhonov terms is about 1 in my experience -- these terms should be weighted roughly like the loss function terms. But that can mean using values for lambda much smaller than 1, since lambda is just one multiplier of those terms in many formulations. The rank has to be greater than the effective rank of the data (of course). It's also something you have to fit to the data experimentally. For normal-ish data sets of normal-ish size, the right number of features is probably 20 - 100. I'd test in that range to start. More features tends to let the model overfit more, so in theory you need more lambda with more features, all else equal. It's *really* something you just have to fit to representative sample data. The optimal answer is way too dependent on the nature, distribution and size of the data to say more than the above. On Tue, Jan 8, 2013 at 8:54 PM, Koobas koo...@gmail.com wrote: Okay, I got a little bit further in my understanding. The matrix of ratings R is replaced with the binary matrix P. Then R is used again in regularization. I get it. This takes care of the situations when you have user-item interactions, but you don't have the rating. So, it can handle explicit feedback, implicit feedback, and mixed (partial / missing feedback). If I have implicit feedback, I just drop R altogether, right? Now the only remaining trick is Tikhonov regularization, which leads to a couple of questions: 1) How much of a problem overfitting is? 2) How do I pick lambda? 3) How do I pick the rank of the approximation in the first place? How does the overfitting problem depend on the rank of the approximation?
Re: alternating least squares
Hi Koobas, We have two classes that implement the solutions described in the ALS-related papers: For explicit feedback data [1] we have AlternatingLeastSquaresSolver, for implicit feedback data [2] we have ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the org.apache.mahout.math.als package. Internally the use Mahout's QRDecomposition to solve the linear systems associated with ALS. Best, Sebastian [1] http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf [2] http://research.yahoo.com/pub/2433 On 08.01.2013 21:53, Koobas wrote: Perhaps somebody can shed some light on the topic. What algorithm is used to solve the least squares problem when computing low-rank approximation using Alternating Least Squares?
Re: alternating least squares
I am trying to wrap my head around it. From the Mahout documentation it looks like it's a standard (dense) QR with Householder reflectors. But the user-item matrix is usually extremely sparse. So, is the sparsity somehow taken into account, or are the sparse right-hand-side vectors packed in dense storage and hit with Householder? The underlying question being the computational complexity, i.e. number of floating point operations involved. On Tue, Jan 8, 2013 at 4:03 PM, Sebastian Schelter s...@apache.org wrote: Hi Koobas, We have two classes that implement the solutions described in the ALS-related papers: For explicit feedback data [1] we have AlternatingLeastSquaresSolver, for implicit feedback data [2] we have ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the org.apache.mahout.math.als package. Internally the use Mahout's QRDecomposition to solve the linear systems associated with ALS. Best, Sebastian [1] http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf [2] http://research.yahoo.com/pub/2433 On 08.01.2013 21:53, Koobas wrote: Perhaps somebody can shed some light on the topic. What algorithm is used to solve the least squares problem when computing low-rank approximation using Alternating Least Squares?
Re: alternating least squares
Can you refer to which documentation you are looking at? ALS is more like a block gradient descent SVD than like a QR. There are relationships in each step, but I don't think that they are identical. Others can comment more authoritatively. On Tue, Jan 8, 2013 at 2:55 PM, Koobas koo...@gmail.com wrote: I am trying to wrap my head around it. From the Mahout documentation it looks like it's a standard (dense) QR with Householder reflectors. But the user-item matrix is usually extremely sparse. So, is the sparsity somehow taken into account, or are the sparse right-hand-side vectors packed in dense storage and hit with Householder? The underlying question being the computational complexity, i.e. number of floating point operations involved. On Tue, Jan 8, 2013 at 4:03 PM, Sebastian Schelter s...@apache.org wrote: Hi Koobas, We have two classes that implement the solutions described in the ALS-related papers: For explicit feedback data [1] we have AlternatingLeastSquaresSolver, for implicit feedback data [2] we have ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the org.apache.mahout.math.als package. Internally the use Mahout's QRDecomposition to solve the linear systems associated with ALS. Best, Sebastian [1] http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf [2] http://research.yahoo.com/pub/2433 On 08.01.2013 21:53, Koobas wrote: Perhaps somebody can shed some light on the topic. What algorithm is used to solve the least squares problem when computing low-rank approximation using Alternating Least Squares?
Re: alternating least squares
Koobas, ALS doesn't apply QR decomposition to the user-item-matrix. It factorizes this matrix into two smaller matrices which contain the user and item features. Each user-feature vector is modeled as a linear combination of the feature vectors of the items which the user rated (and vice versa for the item-feature vectors). This factorization is iteratively refined. In each iteration, ALS first fixes the item-feature vectors and solves a least-squares problem for each user and then fixes the user-feature vectors and solves a least-squares problem for each item. I suggest that you read the papers I referred to in the last mail, which in detail explain the algorithm. Best, Sebastian On 08.01.2013 23:55, Koobas wrote: I am trying to wrap my head around it. From the Mahout documentation it looks like it's a standard (dense) QR with Householder reflectors. But the user-item matrix is usually extremely sparse. So, is the sparsity somehow taken into account, or are the sparse right-hand-side vectors packed in dense storage and hit with Householder? The underlying question being the computational complexity, i.e. number of floating point operations involved. On Tue, Jan 8, 2013 at 4:03 PM, Sebastian Schelter s...@apache.org wrote: Hi Koobas, We have two classes that implement the solutions described in the ALS-related papers: For explicit feedback data [1] we have AlternatingLeastSquaresSolver, for implicit feedback data [2] we have ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the org.apache.mahout.math.als package. Internally the use Mahout's QRDecomposition to solve the linear systems associated with ALS. Best, Sebastian [1] http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf [2] http://research.yahoo.com/pub/2433 On 08.01.2013 21:53, Koobas wrote: Perhaps somebody can shed some light on the topic. What algorithm is used to solve the least squares problem when computing low-rank approximation using Alternating Least Squares?
Re: alternating least squares
Pseudo code for ALS in octave Users = rand(m,n); for i=1:iterations Items = A \ Users; Users = A' \ Items; endfor Start with a random guess of the matrix Users,(mXn) and then you compute matrix Items. which is a least squares solution to Users and then calculate Users which is a least square solution to Items and iterate until convergence. On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote: Koobas, ALS doesn't apply QR decomposition to the user-item-matrix. It factorizes this matrix into two smaller matrices which contain the user and item features. Each user-feature vector is modeled as a linear combination of the feature vectors of the items which the user rated (and vice versa for the item-feature vectors). This factorization is iteratively refined. In each iteration, ALS first fixes the item-feature vectors and solves a least-squares problem for each user and then fixes the user-feature vectors and solves a least-squares problem for each item. I suggest that you read the papers I referred to in the last mail, which in detail explain the algorithm. Best, Sebastian On 08.01.2013 23:55, Koobas wrote: I am trying to wrap my head around it. From the Mahout documentation it looks like it's a standard (dense) QR with Householder reflectors. But the user-item matrix is usually extremely sparse. So, is the sparsity somehow taken into account, or are the sparse right-hand-side vectors packed in dense storage and hit with Householder? The underlying question being the computational complexity, i.e. number of floating point operations involved. On Tue, Jan 8, 2013 at 4:03 PM, Sebastian Schelter s...@apache.org wrote: Hi Koobas, We have two classes that implement the solutions described in the ALS-related papers: For explicit feedback data [1] we have AlternatingLeastSquaresSolver, for implicit feedback data [2] we have ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the org.apache.mahout.math.als package. Internally the use Mahout's QRDecomposition to solve the linear systems associated with ALS. Best, Sebastian [1] http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf [2] http://research.yahoo.com/pub/2433 On 08.01.2013 21:53, Koobas wrote: Perhaps somebody can shed some light on the topic. What algorithm is used to solve the least squares problem when computing low-rank approximation using Alternating Least Squares? -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Re: alternating least squares
That's right, though 80% of the algorithm is in how you efficiently do A \ Users, and in the presence of an overfitting term. You can definitely do it sparse-ly when the loss function is set up cleverly. I tried to explain it here, from slide 6: http://www.slideshare.net/srowen/big-practical-recommendations-with-alternating-least-squares On Tue, Jan 8, 2013 at 5:16 PM, Mohit Singh mohit1...@gmail.com wrote: Pseudo code for ALS in octave Users = rand(m,n); for i=1:iterations Items = A \ Users; Users = A' \ Items; endfor Start with a random guess of the matrix Users,(mXn) and then you compute matrix Items. which is a least squares solution to Users and then calculate Users which is a least square solution to Items and iterate until convergence. On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote: Koobas, ALS doesn't apply QR decomposition to the user-item-matrix. It factorizes this matrix into two smaller matrices which contain the user and item features. Each user-feature vector is modeled as a linear combination of the feature vectors of the items which the user rated (and vice versa for the item-feature vectors). This factorization is iteratively refined. In each iteration, ALS first fixes the item-feature vectors and solves a least-squares problem for each user and then fixes the user-feature vectors and solves a least-squares problem for each item. I suggest that you read the papers I referred to in the last mail, which in detail explain the algorithm. Best, Sebastian On 08.01.2013 23:55, Koobas wrote: I am trying to wrap my head around it. From the Mahout documentation it looks like it's a standard (dense) QR with Householder reflectors. But the user-item matrix is usually extremely sparse. So, is the sparsity somehow taken into account, or are the sparse right-hand-side vectors packed in dense storage and hit with Householder? The underlying question being the computational complexity, i.e. number of floating point operations involved. On Tue, Jan 8, 2013 at 4:03 PM, Sebastian Schelter s...@apache.org wrote: Hi Koobas, We have two classes that implement the solutions described in the ALS-related papers: For explicit feedback data [1] we have AlternatingLeastSquaresSolver, for implicit feedback data [2] we have ImplicitFeedback-AlternatingLeastSquaresSolver. Both can be found in the org.apache.mahout.math.als package. Internally the use Mahout's QRDecomposition to solve the linear systems associated with ALS. Best, Sebastian [1] http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf [2] http://research.yahoo.com/pub/2433 On 08.01.2013 21:53, Koobas wrote: Perhaps somebody can shed some light on the topic. What algorithm is used to solve the least squares problem when computing low-rank approximation using Alternating Least Squares? -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Re: alternating least squares
This particular part of the algorithm can be seen as similar to a least squares problem that might normally be solved by QR. I don't think that the updates are quite the same, however. On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote: This factorization is iteratively refined. In each iteration, ALS first fixes the item-feature vectors and solves a least-squares problem for each user and then fixes the user-feature vectors and solves a least-squares problem for each item.
Re: alternating least squares
There's definitely a QR decomposition in there for me since solving A = X Y' for X is X = A Y (Y' * Y)^-1 and you need some means to compute the inverse of that (small) matrix. On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com wrote: This particular part of the algorithm can be seen as similar to a least squares problem that might normally be solved by QR. I don't think that the updates are quite the same, however. On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote: This factorization is iteratively refined. In each iteration, ALS first fixes the item-feature vectors and solves a least-squares problem for each user and then fixes the user-feature vectors and solves a least-squares problem for each item.
Re: alternating least squares
On Tue, Jan 8, 2013 at 6:41 PM, Sean Owen sro...@gmail.com wrote: There's definitely a QR decomposition in there for me since solving A = X Y' for X is X = A Y (Y' * Y)^-1 and you need some means to compute the inverse of that (small) matrix. Sean, I think I got it. 1) A Y is a handful of sparse matrix-vector products, 2) Y' Y is a dense matrix-matrix on a flat matrix and a tall matrix, producing a small square matrix, 3) inverting that matrix is not a big deal, since it is small. Great! Thanks! It just was not immediately obvious to me at first look. Now, the transition from ratings to 1s and 0s, is this simply to handle implicit feedback, or is this for some other reason? On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com wrote: This particular part of the algorithm can be seen as similar to a least squares problem that might normally be solved by QR. I don't think that the updates are quite the same, however. On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote: This factorization is iteratively refined. In each iteration, ALS first fixes the item-feature vectors and solves a least-squares problem for each user and then fixes the user-feature vectors and solves a least-squares problem for each item.
Re: alternating least squares
But is it actually QR of Y? On Tue, Jan 8, 2013 at 3:41 PM, Sean Owen sro...@gmail.com wrote: There's definitely a QR decomposition in there for me since solving A = X Y' for X is X = A Y (Y' * Y)^-1 and you need some means to compute the inverse of that (small) matrix. On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com wrote: This particular part of the algorithm can be seen as similar to a least squares problem that might normally be solved by QR. I don't think that the updates are quite the same, however. On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote: This factorization is iteratively refined. In each iteration, ALS first fixes the item-feature vectors and solves a least-squares problem for each user and then fixes the user-feature vectors and solves a least-squares problem for each item.
Re: alternating least squares
On Tue, Jan 8, 2013 at 7:18 PM, Ted Dunning ted.dunn...@gmail.com wrote: But is it actually QR of Y? Ted, This is my understanding: In the process of solving the least squares problem, you end up inverting a small square matrix (Y' * Y)-1. How it is done is irrelevant. Since the matrix is square, one could do LU factorization, a.k.a. Gaussian elimination. However, since we are talking here about solving an 100x100 problem, one might as well do it with QR factorization which, unlike LU, is stable no matter what. On Tue, Jan 8, 2013 at 3:41 PM, Sean Owen sro...@gmail.com wrote: There's definitely a QR decomposition in there for me since solving A = X Y' for X is X = A Y (Y' * Y)^-1 and you need some means to compute the inverse of that (small) matrix. On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com wrote: This particular part of the algorithm can be seen as similar to a least squares problem that might normally be solved by QR. I don't think that the updates are quite the same, however. On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote: This factorization is iteratively refined. In each iteration, ALS first fixes the item-feature vectors and solves a least-squares problem for each user and then fixes the user-feature vectors and solves a least-squares problem for each item.
Re: alternating least squares
Great. On Tue, Jan 8, 2013 at 4:25 PM, Koobas koo...@gmail.com wrote: On Tue, Jan 8, 2013 at 7:18 PM, Ted Dunning ted.dunn...@gmail.com wrote: But is it actually QR of Y? Ted, This is my understanding: In the process of solving the least squares problem, you end up inverting a small square matrix (Y' * Y)-1. How it is done is irrelevant. Since the matrix is square, one could do LU factorization, a.k.a. Gaussian elimination. However, since we are talking here about solving an 100x100 problem, one might as well do it with QR factorization which, unlike LU, is stable no matter what. On Tue, Jan 8, 2013 at 3:41 PM, Sean Owen sro...@gmail.com wrote: There's definitely a QR decomposition in there for me since solving A = X Y' for X is X = A Y (Y' * Y)^-1 and you need some means to compute the inverse of that (small) matrix. On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com wrote: This particular part of the algorithm can be seen as similar to a least squares problem that might normally be solved by QR. I don't think that the updates are quite the same, however. On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote: This factorization is iteratively refined. In each iteration, ALS first fixes the item-feature vectors and solves a least-squares problem for each user and then fixes the user-feature vectors and solves a least-squares problem for each item.
Re: alternating least squares
On Tue, Jan 8, 2013 at 7:17 PM, Koobas koo...@gmail.com wrote: On Tue, Jan 8, 2013 at 6:41 PM, Sean Owen sro...@gmail.com wrote: There's definitely a QR decomposition in there for me since solving A = X Y' for X is X = A Y (Y' * Y)^-1 and you need some means to compute the inverse of that (small) matrix. Sean, I think I got it. 1) A Y is a handful of sparse matrix-vector products, 2) Y' Y is a dense matrix-matrix on a flat matrix and a tall matrix, producing a small square matrix, 3) inverting that matrix is not a big deal, since it is small. Great! Thanks! It just was not immediately obvious to me at first look. Now, the transition from ratings to 1s and 0s, is this simply to handle implicit feedback, or is this for some other reason? Okay, I got a little bit further in my understanding. The matrix of ratings R is replaced with the binary matrix P. Then R is used again in regularization. I get it. This takes care of the situations when you have user-item interactions, but you don't have the rating. So, it can handle explicit feedback, implicit feedback, and mixed (partial / missing feedback). If I have implicit feedback, I just drop R altogether, right? Now the only remaining trick is Tikhonov regularization, which leads to a couple of questions: 1) How much of a problem overfitting is? 2) How do I pick lambda? 3) How do I pick the rank of the approximation in the first place? How does the overfitting problem depend on the rank of the approximation? On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning ted.dunn...@gmail.com wrote: This particular part of the algorithm can be seen as similar to a least squares problem that might normally be solved by QR. I don't think that the updates are quite the same, however. On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter s...@apache.org wrote: This factorization is iteratively refined. In each iteration, ALS first fixes the item-feature vectors and solves a least-squares problem for each user and then fixes the user-feature vectors and solves a least-squares problem for each item.
Re: alternating least squares
I think the model you're referring to can use explicit or implicit feedback. It's using the values -- however they are derived -- as weights in the loss function rather than values to be approximated directly. So you still use P even with implicit feedback. Of course you can also use ALS to factor R directly if you wanted, also. Overfitting is as much an issue as in any ML algorithm. Hard to quantify it more than that but you certainly don't want to use lambda = 0. The right value of lambda depends on the data -- depends even more on what you mean by lambda! there are different usages in different papers. More data means you need less lambda. The effective weight on the overfitting / Tikhonov terms is about 1 in my experience -- these terms should be weighted roughly like the loss function terms. But that can mean using values for lambda much smaller than 1, since lambda is just one multiplier of those terms in many formulations. The rank has to be greater than the effective rank of the data (of course). It's also something you have to fit to the data experimentally. For normal-ish data sets of normal-ish size, the right number of features is probably 20 - 100. I'd test in that range to start. More features tends to let the model overfit more, so in theory you need more lambda with more features, all else equal. It's *really* something you just have to fit to representative sample data. The optimal answer is way too dependent on the nature, distribution and size of the data to say more than the above. On Tue, Jan 8, 2013 at 8:54 PM, Koobas koo...@gmail.com wrote: Okay, I got a little bit further in my understanding. The matrix of ratings R is replaced with the binary matrix P. Then R is used again in regularization. I get it. This takes care of the situations when you have user-item interactions, but you don't have the rating. So, it can handle explicit feedback, implicit feedback, and mixed (partial / missing feedback). If I have implicit feedback, I just drop R altogether, right? Now the only remaining trick is Tikhonov regularization, which leads to a couple of questions: 1) How much of a problem overfitting is? 2) How do I pick lambda? 3) How do I pick the rank of the approximation in the first place? How does the overfitting problem depend on the rank of the approximation?