Some ideas for Mahout 0.5

Sebastian Schelter Mon, 04 Oct 2010 07:04:10 -0700

Hi,

The amount of work that is currently put in finishing 0.4 is amazing, Ican hardly follow all the mails, very cool to see that. I've had sometime today to write down ideas of features I have for version 0.5 andwant to share it here for feedback.


First I can think of possible new features for RecommenderJob

* add an option that makes the RecommenderJob use the output of therelatedo.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead ofcomputingthe similarities again each time, this will give users thepossibility to

    choose the interval in which to precompute the item similarities

* add an option to make the RecommenderJob include "recommendedbecause of"items to each recommended item (analogous to what is alreadyavailable atGenericItemBasedRecommender.recommendedBecause(...)), showing thisto users

    helps them understand why some item was recommended to them

Second I'd like Mahout to have a Map/Reduce implementation of thealgorithm described in Y. Zhou et al.: "Large-scale ParallelCollaborative Filtering for the Netflix Prize" (http://bit.ly/cUPgqr).

Here R is the matrix of ratings of users towards movies and each userand each movie is projected on a "feature" space (the number of featuresis defined before) so that the product of the resulting matrices U and Mis a low-rank approximization/factorization of R.

Determining U and M is mathematically modelled as an optimizationproblem and additionally some regularization is applied to avoidoverfitting to the known entries. This problem is solved with aniterative approach called alternate least squares (ALS).

If I understand the paper correctly this approach is easilyparallelizable. In order to estimate an user feature vector you needonly access to all his ratings and the feature vectors of all movieshe/she rated. To estimate a movie feature vector you need access to allits ratings and to the feature vectors of the users who rated it.

An unknown preference can then be predicted by computing the dot productof the according user and movie feature vectors.

Would be very nice if someone who is familiar with the paper or has thetime for a brief look into it could validate that, cause I don't fullytrust my mathematical analysis.

I already created a first prototype implementation but I definitely needhelp from someone checking it conceptually, optimizing the math relatedparts and help me test ist. Maybe that could be an interesting task forthe upcoming Mahout hackathon in Berlin.


--sebastian

PS: @isabel I won't make it to the dinner today, need to rehearse mytalk...

Some ideas for Mahout 0.5

Reply via email to