Hi,
The amount of work that is currently put in finishing 0.4 is amazing, I
can hardly follow all the mails, very cool to see that. I've had some
time today to write down ideas of features I have for version 0.5 and
want to share it here for feedback.
First I can think of possible new features for RecommenderJob
* add an option that makes the RecommenderJob use the output of the
related
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead of
computing
the similarities again each time, this will give users the
possibility to
choose the interval in which to precompute the item similarities
* add an option to make the RecommenderJob include "recommended
because of"
items to each recommended item (analogous to what is already
available at
GenericItemBasedRecommender.recommendedBecause(...)), showing this
to users
helps them understand why some item was recommended to them
Second I'd like Mahout to have a Map/Reduce implementation of the
algorithm described in Y. Zhou et al.: "Large-scale Parallel
Collaborative Filtering for the Netflix Prize" (http://bit.ly/cUPgqr).
Here R is the matrix of ratings of users towards movies and each user
and each movie is projected on a "feature" space (the number of features
is defined before) so that the product of the resulting matrices U and M
is a low-rank approximization/factorization of R.
Determining U and M is mathematically modelled as an optimization
problem and additionally some regularization is applied to avoid
overfitting to the known entries. This problem is solved with an
iterative approach called alternate least squares (ALS).
If I understand the paper correctly this approach is easily
parallelizable. In order to estimate an user feature vector you need
only access to all his ratings and the feature vectors of all movies
he/she rated. To estimate a movie feature vector you need access to all
its ratings and to the feature vectors of the users who rated it.
An unknown preference can then be predicted by computing the dot product
of the according user and movie feature vectors.
Would be very nice if someone who is familiar with the paper or has the
time for a brief look into it could validate that, cause I don't fully
trust my mathematical analysis.
I already created a first prototype implementation but I definitely need
help from someone checking it conceptually, optimizing the math related
parts and help me test ist. Maybe that could be an interesting task for
the upcoming Mahout hackathon in Berlin.
--sebastian
PS: @isabel I won't make it to the dinner today, need to rehearse my
talk...