Well first I’d ignore ratings. There are too many problems trying to normalize 
or understand the meaning of a rating. If you follow the rest of this advice it 
will ignore them anyway. Ratings were used in older recommenders but have 
become meaningless with recent thinking. Netflix made the idea popular with the 
Netflix prize but since then even they do not use ratings to recommend since 
ranking of the best recs is far more important than predicting your rating. We 
can handle negative preferences in a different way, but that will come later.

Use the Mahout driver 'spark-rowsimilarity’. It will read text csv style data 
and create the matrix, compare rows (users in your case) and output one user 
per line (user-id,list of similar users). The IDs will be your input ids so 
unlike the older hadoop mapreduce version of this in Mahout, the spark version 
will maintain your ids.

This will use LLR to find non-coinsidental similarities in the things users 
prefer. LLR has been shown to be much better at detecting similarities in 
preference data. Cosine may be good for text similarity but you’d want to use 
LLR to downsample out the noise terms first anyway. 

See some docs here:  
http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
search for "spark-rowsimilarity”

LLR is discussed here: 
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html and inside 
this free ebook: https://www.mapr.com/practical-machine-learning
  
On Apr 8, 2015, at 12:03 PM, Jonathan Seale <jonat...@samegrain.com> wrote:

Hi all,

I'm new to the community and Mahout. Happy to be here. :-)

I have the following problem that I'm having difficulty with. I've setup an 
instance on Amazon with Mahout and can run some basic machine learning tasks 
(just testing). Now I'm trying to do a specific task and am unsure how to 
proceed.

Imagine I have a data file containing the following columns: user_id, item_id, 
and rating, where rating is how each user rated the item on a scale of -1 to 1 
(the necessity of negative ratings will become apparent in a minute). 
Ultimately, what I'm trying to do is create a similarity matrix that measures 
the similarity between all pairs of USERS. To do this, I would like to 
transform the users' ratings into a matrix (rows are users, columns are items) 
and then run RowSimilarity to find the dot product / cosine between all rows.

I feel like my problem is simple and has probably been done 1000 times, but I 
can't seem to find any documentation directly on the subject. The best I've 
been able to do so far is use the similaritem function (where I've swapped item 
for user). While it works and gives decent results, it's mathematically not 
quite what I want. Help!

Thanks!
Jonathan




Reply via email to