For Decision Forests, my goal for 0.5 is to add a 'full'
implementation. Meaning, an implementation that can build random
forests using the whole dataset, even if its split among many
machines. I found the following paper to be very interesting:
http://www.cba.ua.edu/~mhardin/rainforest.pdf
although the described approach doesn't work as it is for numerical attributes.

The implementation should at least work for the following dataset:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2304&categoryID=248
it's 50 GB, and a small subset is available in UCI. It contains only
categorical attributes, and it's big enough to be a good candidate.

In another note, my svn password has not been restored yet, so I am
more a contributor than a committer =P

On Mon, Oct 4, 2010 at 3:11 PM, Ted Dunning <[email protected]> wrote:
> My own feeling is that we need to get some sort of recommender that supports
> side information, possibly also as a classifier.
>
> As everybody knows, I have been lately quite enamored of Menon and Elkan's
> paper on Latent Factor Log-Linear models.  It seems
> to subsume most other factorization methods and supports side data very
> naturally.  Training is reportedly very fast using SGD
> techniques.
>
> The paper is here: http://arxiv.org/abs/1006.2156
>
> On Mon, Oct 4, 2010 at 7:03 AM, Sebastian Schelter <[email protected]> wrote:
>
>> Hi,
>>
>> The amount of work that is currently put in finishing 0.4 is amazing, I can
>> hardly follow all the mails, very cool to see that. I've had some time today
>> to write down ideas of features I have for version 0.5 and want to share it
>> here for feedback.
>>
>> First I can think of possible new features for RecommenderJob
>>
>>  * add an option that makes the RecommenderJob use the output of the
>> related
>>    o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead of
>> computing
>>    the similarities again each time, this will give users the possibility
>> to
>>    choose the interval in which to precompute the item similarities
>>
>>  * add an option to make the RecommenderJob include "recommended because
>> of"
>>    items to each recommended item (analogous to what is already available
>> at
>>    GenericItemBasedRecommender.recommendedBecause(...)), showing this to
>> users
>>    helps them understand why some item was recommended to them
>>
>>
>> Second I'd like Mahout to have a Map/Reduce implementation of the algorithm
>> described in Y. Zhou et al.: "Large-scale Parallel Collaborative Filtering
>> for the Netflix Prize" (http://bit.ly/cUPgqr).
>>
>> Here R is the matrix of ratings of users towards movies and each user and
>> each movie is projected on a "feature" space (the number of features is
>> defined before) so that the product of the resulting matrices U and M is a
>> low-rank approximization/factorization of R.
>>
>> Determining U and M is mathematically modelled as an optimization problem
>> and additionally some regularization is applied to avoid overfitting to the
>> known entries. This problem is solved with an iterative approach called
>> alternate least squares (ALS).
>>
>> If I understand the paper correctly this approach is easily parallelizable.
>> In order to estimate an user feature vector you need only access to all his
>> ratings and the feature vectors of all movies he/she rated. To estimate a
>> movie feature vector you need access to all its ratings and to the feature
>> vectors of the users who rated it.
>>
>> An unknown preference can then be predicted by computing the dot product of
>> the according user and movie feature vectors.
>>
>> Would be very nice if someone who is familiar with the paper or has the
>> time for a brief look into it could validate that, cause I don't fully trust
>> my mathematical analysis.
>>
>> I already created a first prototype implementation but I definitely need
>> help from someone checking it conceptually, optimizing the math related
>> parts and help me test ist. Maybe that could be an interesting task for the
>> upcoming Mahout hackathon in Berlin.
>>
>> --sebastian
>>
>> PS: @isabel I won't make it to the dinner today, need to rehearse my
>> talk...
>>
>

Reply via email to