Data distribution guidance for recommendation engines
Hi all, This questions stems from my use of the alternating least squares method in mahout, but errs on the theoretical side. If this is the wrong place for such a question, I apologize up front and would gladly direct my question to a more appropriate forum, as per your suggestions. I have been thinking about how the distribution of rating data can influence a model built using ALS or any matrix factorization method for that matter. If I split my data into train and test sets, I can show good performance of the model on the train set. What might I expect given an uneven distribution of ratings? Imagine a situation where 50% of the ratings are 1s, and the rest 2-5. Will the model be biased towards rating items a 1? Do people pre-process their data to avoid skewed ratings distributions? How about the rating scale itself. For example, given [1:3] vs [1:10] ranges, in with the former, you've got a 1/3 chance of predicting the correct rating, say, while in the latter case it is a 1/10. Or, when is sparse too sparse, or can these questions even be answered because they are too system/context specific? Ultimately, I'm trying to figure out under what conditions one would look at a model and say that is crap, pardon my language. Do any more experienced users have any advice to offer on when a factor model would break down or any of my points above? Thanks in advance, -Chloe
Re: Fold-in for ALSWR
Thanks again for replying. I didn't expect that since I'm using explicit feedback, not implicit, but mostly because the part files in U/ and V/ multiplied together give me back predicted ratings on a 1-4 scale. Would converting the 0/1 connection indicator to a 1-4 scale be any sort of reasonable on capturing the strength of the connection or is that entirely unjustified? -Chloe
Re: Fold-in for ALSWR
Dear Sean, Thanks a lot for a quick and helpful reply. Having been sidetracked with another project, I revisited the problem I posed in my post over the weekend and, unfortunately, have a follow up question. The problem I'm facing with my implementation of your explanation is that the predicted ratings for new users seem to be on a very different scale than the original ratings the model is based on and I'm wondering what I've done wrong. To recap my steps in pseudocode: 1. Use a text file of ratings on 1-4 scale to generate my model afterward given by files U/part-m-0 and V/part-m-0, or Ratings = UV'. 2. Vector newRatings = new Vector(); ex. given 10 items a new user's ratings looks like {0,1,0,3,4,0,2,3,0,1} Matrix Au = new DenseMatrix(newRatings.size(), 1); Au.assignColumn(0, newRatings); QRDecomposition qr = new QRDecomposition(V); //item features Matrix Xu = qr.solve(Au); Matrix predictedUserRatingsForAllItems = (Xu.transpose()).times(V.transpose()); 3. DenseVector predictedUserRatingsVector = (DenseVector)predictedUserRatingsForAllItems.viewRow(0); The predictedUserRatingsVector from step 3, however, gives a top 10 item result with scores ranging from 0.46-0.62. These numbers go up with the number of new items rated. Which means that even for item 5, given the highest possible score of 4, this approach can't even give back a rating for a rated item close to its actual value. Moreover, the new user's ratings I test, {0,1,0,3,4,0,2,3,0,1}, are actually identical to an existing user that was used to build the model and whose predicted ratings are very reasonable, looking like {0.5,0.98,0.89,3.23,4.1,1.01,2.32,2.99,3.5,1.1}. I must be doing something wrong or missing something. Is there anything you or anyone from the list with fold-in experience can suggest I try or consider that would explain why this is happening? I expected that predicted ratings from fold-in would not be as good as regenerating the model, but not this bad. Many thanks, Chloe
Fold-in for ALSWR
Hi everyone, I am reaching out to the list requesting some help/advice on implementing fold-in with the Alternating Least Squares algo in Mahout, a problem on which I am stumped. I've read other posts on the list and over on SO, like: http://stackoverflow.com/questions/12444231/svd-matrix-conditioning- how-to-project-from-original-space-to-conditioned-spac and http://stackoverflow.com/questions/12857693/mahout-how-to-make- recommendations-for-new-users including the slide show talk posted in the last link. This slide show seems to be the most relevant to my purposes, specifically slide 14, but I can't fully understand it. At the end of ALS I have Ratings = A = UM', where U is user feature matrix and M is item feature matrix. For an existing user with a new rating, I plan to update the Auser only,which is (1 x number of items) by replacing/adding new rating at the appropriate index and then do: Uuser = Auser * V' ( 1 x items)(items x features) Sub Uuser back into U and get new recommendations How do I address a NEW user? From the slide: X = A (Y')^-1 or X = A Y ((Y'Y) ^-1) Is the 2nd preferable? If so, why? I would appreciate a more 'wordy' explanation or someone pointing me to any papers I can read on this. Also, as a slight offshoot, given that mahout doesn't have matrix inversion, is it more advisable to use QR decomp to find a pseudoinverse or an outside package to do inversion? Apologies for the long post which I did try and make concise, and thank you for any help you offer. Chloe