For Decision Forests, my goal for 0.5 is to add a 'full' implementation. Meaning, an implementation that can build random forests using the whole dataset, even if its split among many machines. I found the following paper to be very interesting: http://www.cba.ua.edu/~mhardin/rainforest.pdf although the described approach doesn't work as it is for numerical attributes.
The implementation should at least work for the following dataset: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2304&categoryID=248 it's 50 GB, and a small subset is available in UCI. It contains only categorical attributes, and it's big enough to be a good candidate. In another note, my svn password has not been restored yet, so I am more a contributor than a committer =P On Mon, Oct 4, 2010 at 3:11 PM, Ted Dunning <[email protected]> wrote: > My own feeling is that we need to get some sort of recommender that supports > side information, possibly also as a classifier. > > As everybody knows, I have been lately quite enamored of Menon and Elkan's > paper on Latent Factor Log-Linear models. It seems > to subsume most other factorization methods and supports side data very > naturally. Training is reportedly very fast using SGD > techniques. > > The paper is here: http://arxiv.org/abs/1006.2156 > > On Mon, Oct 4, 2010 at 7:03 AM, Sebastian Schelter <[email protected]> wrote: > >> Hi, >> >> The amount of work that is currently put in finishing 0.4 is amazing, I can >> hardly follow all the mails, very cool to see that. I've had some time today >> to write down ideas of features I have for version 0.5 and want to share it >> here for feedback. >> >> First I can think of possible new features for RecommenderJob >> >> * add an option that makes the RecommenderJob use the output of the >> related >> o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead of >> computing >> the similarities again each time, this will give users the possibility >> to >> choose the interval in which to precompute the item similarities >> >> * add an option to make the RecommenderJob include "recommended because >> of" >> items to each recommended item (analogous to what is already available >> at >> GenericItemBasedRecommender.recommendedBecause(...)), showing this to >> users >> helps them understand why some item was recommended to them >> >> >> Second I'd like Mahout to have a Map/Reduce implementation of the algorithm >> described in Y. Zhou et al.: "Large-scale Parallel Collaborative Filtering >> for the Netflix Prize" (http://bit.ly/cUPgqr). >> >> Here R is the matrix of ratings of users towards movies and each user and >> each movie is projected on a "feature" space (the number of features is >> defined before) so that the product of the resulting matrices U and M is a >> low-rank approximization/factorization of R. >> >> Determining U and M is mathematically modelled as an optimization problem >> and additionally some regularization is applied to avoid overfitting to the >> known entries. This problem is solved with an iterative approach called >> alternate least squares (ALS). >> >> If I understand the paper correctly this approach is easily parallelizable. >> In order to estimate an user feature vector you need only access to all his >> ratings and the feature vectors of all movies he/she rated. To estimate a >> movie feature vector you need access to all its ratings and to the feature >> vectors of the users who rated it. >> >> An unknown preference can then be predicted by computing the dot product of >> the according user and movie feature vectors. >> >> Would be very nice if someone who is familiar with the paper or has the >> time for a brief look into it could validate that, cause I don't fully trust >> my mathematical analysis. >> >> I already created a first prototype implementation but I definitely need >> help from someone checking it conceptually, optimizing the math related >> parts and help me test ist. Maybe that could be an interesting task for the >> upcoming Mahout hackathon in Berlin. >> >> --sebastian >> >> PS: @isabel I won't make it to the dinner today, need to rehearse my >> talk... >> >
