[GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop -----------------------------------------------------------------------
Key: MAHOUT-371 URL: https://issues.apache.org/jira/browse/MAHOUT-371 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Richard Simon Just *****Basic Proposal - just to let you know what I have in mind. Will add more detail as to actual implementation and some background information about myself later today***** Title: Proposal to implement Distributed SVD++ Recommender using Hadoop [adresses MAHOUT-329] Student: Richard Simon Just Basic Proposal: During the Netflix Prize Challenge one of the most popular forms of Recommender algorithm was that of Matrix Factorisation, in particular Singular Value Decomposition (SVD). As such this proposal looks to implement a distributed version of one of the most successful SVD-based recommender algorithms from the Netflix competition. Namely, the SVD++ algorithm. The SVD++ improves upon other basic SVD algorithms by incorporating implicit feedback[1]. That is to say that it is able to take into account not just explicit user preferences, but also feedback such as, in the case of a company like Netflix, whether a movie has been rented. Implicit feedback assumes that the fact of there being some correlation between the user and the item is more important that whether the correlation is positive or negative. Implicit feedback would account for an item has being rated, but not what the rating was. The implementation will include testing, in-depth documentation and a demo/tutorial. If there is time, I will also look to developing the algorithm into the timeSVD++ algorithm[2]. The timeSVD++ further improves the results of the SVD algorithm by taking into account temporal dynamics. Temporal dynamics addresses the way user preferences in items and their behaviour in how they rate items can change over time. According to [2] the gains in accuracy implementing timeSVD++ are significantly bigger than the gains going from SVD to SVD++. The overall project will provide three deliverables: 1. The basic framework for distributed SVD-based recommender 2. A distributed SVD++ implementation 3. A distributed timeSVD++ Timeline: The Warm Up/Bonding Period (<=May 23rd): - familiarise myself further with Mahout and Hadoop's code base and documentation - discuss with community the proposal, design and implementation as well as related code tests, optimisations and documentation they would like to see incorporated into the project - build a more detailed design of algorithm implementation and tweak timeline based on feedback - familiarise myself more with unit testing - finish building 3-4 node Hadoop cluster and play with all the examples Week 1 (May 24th-30th): - start writing the back bone of the code in the form of comments and skeleton code - implement SVDppRecommenderJob - start to integrate DistributedLanzcosSolver Week 2(May 31st - June 6th): - complete DistributedLanzcosSolver integration - start implementing distributed training, prediction and regularisation Week 3 - 5(June 7th - 27th): - complete implementation of distributed training, prediction and regularisation - work on testing, cleaning up code, and tying up any loose documentation ends - work on any documentation, tests and optimisation requested by community - Deliverable : basic framework for distributed SVD-based recommender Week 6 - 7(June 28th-July 11th): - start implementation of SVD++ (keeping documentation and tests up-to-date) - prepare demo Week 8(July 12th - 18th): Mid-Term Report by the 16th - complete SVD++ and iron out bugs - implement and document demo - write wiki articles and tutorial for what has been implemented including the demo Week 9(July 19th - 25th): - work on any documentation, tests and optimisation requested by community during project - work on testing, cleaning up code, and tying up any loose documentation ends - Deliverable : Distributed SVD++ Recommender (including Demo) Week 10 - 11(July 26th - Aug 8th): - incorporate temporal dynamics - write temporal dynamics documentation, including wiki article Week 12(Aug 9th - 15th):Suggested Pencils Down - last optimisation and tidy up of code, documentation, tests and demo - discuss with community what comes next, consider what JIRA issues to contribute to - Deliverable: Distributed SVD++ Recommender with temporal dynamics Final Evaluations Hand-in: Aug 16th-20th. References: [1] - Y. Koren, "Factorization Meets the Neighborhood: a Mulitfaceted Collaborative Filtering Model", ACM Press, 2008, http://public.research.att.com/~volinsky/netflix/kdd08koren.pdf [2] - Y. Koren, "Collaborative Filtering with temporal Dynamics", ACM Press, 2009, http://research.yahoo.com/files/kdd-fp074-koren.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.