Hi,

The amount of work that is currently put in finishing 0.4 is amazing, I can hardly follow all the mails, very cool to see that. I've had some time today to write down ideas of features I have for version 0.5 and want to share it here for feedback.

First I can think of possible new features for RecommenderJob

* add an option that makes the RecommenderJob use the output of the related o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead of computing the similarities again each time, this will give users the possibility to
    choose the interval in which to precompute the item similarities

* add an option to make the RecommenderJob include "recommended because of" items to each recommended item (analogous to what is already available at GenericItemBasedRecommender.recommendedBecause(...)), showing this to users
    helps them understand why some item was recommended to them


Second I'd like Mahout to have a Map/Reduce implementation of the algorithm described in Y. Zhou et al.: "Large-scale Parallel Collaborative Filtering for the Netflix Prize" (http://bit.ly/cUPgqr).

Here R is the matrix of ratings of users towards movies and each user and each movie is projected on a "feature" space (the number of features is defined before) so that the product of the resulting matrices U and M is a low-rank approximization/factorization of R.

Determining U and M is mathematically modelled as an optimization problem and additionally some regularization is applied to avoid overfitting to the known entries. This problem is solved with an iterative approach called alternate least squares (ALS).

If I understand the paper correctly this approach is easily parallelizable. In order to estimate an user feature vector you need only access to all his ratings and the feature vectors of all movies he/she rated. To estimate a movie feature vector you need access to all its ratings and to the feature vectors of the users who rated it.

An unknown preference can then be predicted by computing the dot product of the according user and movie feature vectors.

Would be very nice if someone who is familiar with the paper or has the time for a brief look into it could validate that, cause I don't fully trust my mathematical analysis.

I already created a first prototype implementation but I definitely need help from someone checking it conceptually, optimizing the math related parts and help me test ist. Maybe that could be an interesting task for the upcoming Mahout hackathon in Berlin.

--sebastian

PS: @isabel I won't make it to the dinner today, need to rehearse my talk...

Reply via email to