[jira] Updated: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

Richard Simon Just (JIRA) Fri, 09 Apr 2010 11:24:21 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Richard Simon Just updated MAHOUT-371:
--------------------------------------

    Description: 
Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ 
Recommender using Hadoop

Student Name: Richard Simon Just

Student E-mail:i...@richardsimonjust.co.uk

Organization/Project: Apache Mahout

Assigned Mentor:


Proposal Abstract: 
During the Netflix Prize Challenge one of the most popular forms of Recommender 
algorithm was that of Matrix Factorisation, in particular Singular Value 
Decomposition (SVD). As such this proposal looks to implement a distributed 
version of one of the most successful SVD-based recommender algorithms from the 
Netflix competition. Namely, the SVD++ algorithm. 
The SVD++ improves upon other basic SVD algorithms by incorporating implicit 
feedback[1]. That is to say that it is able to take into account not just 
explicit user preferences, but also feedback such as, in the case of a company 
like Netflix, whether a movie has been rented. Implicit feedback assumes that 
the fact of there being some correlation between the user and the item is more 
important that whether the correlation is positive or negative. Implicit 
feedback would account for an item has being rated, but not what the rating was.
The implementation will include testing, in-depth documentation and a 
demo/tutorial. If there is time, I will also look to developing the algorithm 
into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results 
of the SVD algorithm by taking into account temporal dynamics. Temporal 
dynamics addresses the way user preferences in items and their behaviour in how 
they rate items can change over time. According to [2] the gains in accuracy 
implementing timeSVD++ are significantly bigger than the gains going from SVD 
to SVD++. 
The overall project will provide three deliverables:
1. The basic framework for distributed SVD-based recommender
2. A distributed SVD++ implementation and demo
3. A distributed timeSVD++ implementation



Detailed Description:

The SVD++ algorithm uses the principle of categorising users and items into 
factors, combined with regularisation and implicit feedback to predict how much 
a user is likely to match an item. Factors are abstract categorises that are 
created from comparing the data presented. Factor values are grades saying how 
much each user/item is related to that category. For example with the Netflix 
data a factor could loosely correspond to a movie genre, a director or story 
line. The more factors used the more detailed the categories are likely to 
become, and thus the more accurate the predictions are likely to become. 

Implicit feedback is based on the theory that a user having any sort of 
relationship to an item is more important that whether they rated it, or what 
rating they gave. The assumption is that even if a user does not like an item, 
or has not rated it, the very fact that they choose to have some interaction 
with it indicates something about their preferences. In the Netflix case this 
would be represented by whether a user has rated a movie or not,  it could also 
take into account the user's rental history. 

As well as the actual implementation of the code the project has two other 
deliverable focuses. The readability, documentation, testing of the code; and a 
full tutorial and demo of the code. It is felt that without these things the 
implementation is not really complete or fully usable. 

The recommender consists of 5 main parts. The job class that runs the code, an 
input conversion section, a training section, a re-trainer section and a 
prediction section. A brief overview of these sections follows.


The SVD++  Classes:

The Recommender Job class:
This class is the foundation of the recommender and allows it to run on Hadoop 
by implementing the Tool interface through AbstractJob. This class will parse 
any user arguments and setup the jobs that will run the algorithm on 
Map/Reduce, much in the same way Mahout's other distributed recommenders, do 
such as RecommenderJob.


Input Mapper/Reducer Classes:
The Mapper will take the input data and convert it to key value pairs in the 
form of a hadoop Writeable. The Reducer will take the Mapper Writeables and 
create Sparse vectors. This is in keeping with Mahout's ToItemPrefsMapper and 
ToUserVectorReducer.

Training Mapper/Reducer Classes:
This phase creates the factor matrices that will be used to create the 
predictions later. It does this by making a prediction of a known rating using 
the SVD, comparing it against the known rating, calculating the error and 
updating the factors accordingly. The Mapper will loop through the training of 
the factor matrices. The Reducer will collect the outputs of the Mapper to 
create the dense factor matrices.

Re-trainer Mapper/Reducer Classes:
This is a separate job that would be used to update the factor matrices taking 
into account any new user ratings. The Mapper/Reducer follow a similar pattern 
to the training section.

Prediction Mapper/Reducer Classes:
This job would take the factor matrices from the training set and use them in 
the SVD to create the required predictions. Most likely this would be done for 
all user/item preferences, to create dense matrix of all user/item 
combinations. However potentially it could also be done for individual users.
Just like training, the Mapper would handle the predictions the Reducer would 
collect them into a dense matrix.


Timeline:

The Warm Up/Bonding Period (<=May 23rd):
- familiarise myself further with Mahout and Hadoop's code base and 
documentation
- discuss with community the proposal, design and implementation as well as 
related code tests, optimisations and documentation they would like to see 
incorporated into the project
- build a more detailed design of algorithm implementation and tweak timeline 
based on feedback
- familiarise myself more with unit testing
- finish building 3-4 node Hadoop cluster and play with all the examples

Week 1 (May 24th-30th):
- start writing the back bone of the code in the form of comments and skeleton 
code
- implement SVDppRecommenderJob
- start to integrate DistributedLanzcosSolver

Week 2 (May 31st - June 6th):
- complete DistributedLanzcosSolver integration
- start implementing distributed training, prediction and regularisation

Week 3 - 5 (June 7th - 27th):
- complete implementation of distributed training, prediction and regularisation
- work on testing, cleaning up code, and tying up any loose documentation ends
- work on any documentation, tests and optimisation requested by community
- Deliverable : basic framework for distributed SVD-based recommender

Week 6 - 7 (June 28th-July 11th):
- start implementation of SVD++ (keeping documentation and tests up-to-date)
- prepare demo

Week 8 (July 12th - 18th): Mid-Term Report by the 16th
- complete SVD++ and iron out bugs
- implement and document demo
- write wiki articles and tutorial for what has been implemented including the 
demo

Week 9 (July 19th - 25th):
- work on any documentation, tests and optimisation requested by community 
during project
- work on testing, cleaning up code, and tying up any loose documentation ends
- Deliverable : Distributed SVD++ Recommender (including Demo)

Week 10 - 11 (July 26th - Aug 8th):
- incorporate temporal dynamics
- write temporal dynamics documentation, including wiki article

Week 12 (Aug 9th - 15th): Suggested Pencils Down
- last optimisation and tidy up of code, documentation, tests and demo
- discuss with community what comes next, consider what JIRA issues to 
contribute to 
- Deliverable: Distributed SVD++ Recommender with temporal dynamics

Final Evaluations Hand-in: Aug 16th-20th. 



Additional Information:

I am currently a final year undergraduate at the University of Reading, UK, 
studying BSc Applied Computer Science. As part of this degree I did a 9 month 
industrial placement programming in Java for an international non-profit 
organisation. While in this final year of study I have been fortunate enough to 
take modules in Distributed Computing, Evolutionary Computation and Concurrent 
Systems, which have proven to be the best of my University career. Each module 
required me to design and implement algorithms of either a distributed or 
machine learning nature [A], two in Java, one in C. For my final year project I 
have implemented a Gaming Persistence Platform in ActionScript, PHP and MySQL. 
It allows the user to play computer games integrated with the platform from 
different devices and/or different software platforms while retaining game 
progress across all instances. I am looking forward to the day when mobile 
computing goes distributed.

I am a passionate user of open source software and now I am looking to become 
involved as a developer. Having enjoyed my Distributed Computing and 
Evolutionary Computation modules so much, and after reading all the 
introductory pages about the ASF I realised Mahout would be a great place to 
start. After graduation (and GSoC) I hope to continue contributing to Mahout 
while working in a related field. 

Outside of uni, when I'm not coding, I like to push my limits through 
meditation, mountaineering and running. I also love to cook!

After my last exam on May 21st my only commitment over the summer will be GSoC. 
As such GSoC will be treated like a full-time job.

[1] - Y. Koren, "Factorization Meets the Neighborhood: a Mulitfaceted 
Collaborative Filtering Model", ACM Press, 2008, 
http://public.research.att.com/~volinsky/netflix/kdd08koren.pdf
[2] - Y. Koren, "Collaborative Filtering with temporal Dynamics", ACM Press, 
2009, http://research.yahoo.com/files/kdd-fp074-koren.pdf
[3] - R.M.Bell, Y. Koren, and C. Volinsky, "The Bellkor 2008 Solution to the 
Netflix Prize", http://www.netflixprize.com/assets/ProgressPrize2008_BellKor.pdf

[A] - Example of code I've implemented as part of University degree, 
http://www.richardsimonjust.com/code/

  was:

*****Basic Proposal - just to let you know what I have in mind. Will add more 
detail as to actual implementation and some background information about myself 
later today*****


Title: Proposal to implement Distributed SVD++ Recommender using Hadoop 
[adresses MAHOUT-329]

Student: Richard Simon Just 

Basic Proposal: 

During the Netflix Prize Challenge one of the most popular forms of Recommender 
algorithm was that of Matrix Factorisation, in particular Singular Value 
Decomposition (SVD). As such this proposal looks to implement a distributed 
version of one of the most successful SVD-based recommender algorithms from the 
Netflix competition. Namely, the SVD++ algorithm. 

The SVD++ improves upon other basic SVD algorithms by incorporating implicit 
feedback[1]. That is to say that it is able to take into account not just 
explicit user preferences, but also feedback such as, in the case of a company 
like Netflix, whether a movie has been rented. Implicit feedback assumes that 
the fact of there being some correlation between the user and the item is more 
important that whether the correlation is positive or negative. Implicit 
feedback would account for an item has being rated, but not what the rating was.

The implementation will include testing, in-depth documentation and a 
demo/tutorial. If there is time, I will also look to developing the algorithm 
into the timeSVD++ algorithm[2]. The timeSVD++ further improves the results of 
the SVD algorithm by taking into account temporal dynamics. Temporal dynamics 
addresses the way user preferences in items and their behaviour in how they 
rate items can change over time. According to [2] the gains in accuracy 
implementing timeSVD++ are significantly bigger than the gains going from SVD 
to SVD++. 

The overall project will provide three deliverables:
     1. The basic framework for distributed SVD-based recommender
     2. A distributed SVD++ implementation
     3. A distributed timeSVD++ 



Timeline:


The Warm Up/Bonding Period (<=May 23rd):
- familiarise myself further with Mahout and Hadoop's code base and 
documentation
- discuss with community the proposal, design and implementation as well as 
related code tests, optimisations and documentation they would like to see 
incorporated into the project
- build a more detailed design of algorithm implementation and tweak timeline 
based on feedback
- familiarise myself more with unit testing
- finish building 3-4 node Hadoop cluster and play with all the examples

Week 1 (May 24th-30th):
- start writing the back bone of the code in the form of comments and skeleton 
code
- implement SVDppRecommenderJob
- start to integrate DistributedLanzcosSolver

Week 2(May 31st - June 6th):
- complete DistributedLanzcosSolver integration
- start implementing distributed training, prediction and regularisation

Week 3 - 5(June 7th - 27th):
- complete implementation of distributed training, prediction and regularisation
- work on testing, cleaning up code, and tying up any loose documentation ends
- work on any documentation, tests and optimisation requested by community
- Deliverable : basic framework for distributed SVD-based recommender

Week 6 - 7(June 28th-July 11th):
- start implementation of SVD++ (keeping documentation and tests up-to-date)
- prepare demo

Week 8(July 12th - 18th): Mid-Term Report by the 16th
- complete SVD++ and iron out bugs
- implement and document demo
- write wiki articles and tutorial for what has been implemented including the 
demo

Week 9(July 19th - 25th):
- work on any documentation, tests and optimisation requested by community 
during project
- work on testing, cleaning up code, and tying up any loose documentation ends
- Deliverable : Distributed SVD++ Recommender (including Demo)

Week 10 - 11(July 26th - Aug 8th):
- incorporate temporal dynamics
- write temporal dynamics documentation, including wiki article

Week 12(Aug 9th - 15th):Suggested Pencils Down
- last optimisation and tidy up of code, documentation, tests and demo
- discuss with community what comes next, consider what JIRA issues to 
contribute to 
- Deliverable: Distributed SVD++ Recommender with temporal dynamics

Final Evaluations Hand-in: Aug 16th-20th. 


References:

[1] - Y. Koren, "Factorization Meets the Neighborhood: a Mulitfaceted 
Collaborative Filtering Model", ACM Press, 2008, 
http://public.research.att.com/~volinsky/netflix/kdd08koren.pdf
[2] - Y. Koren, "Collaborative Filtering with temporal Dynamics", ACM Press, 
2009, http://research.yahoo.com/files/kdd-fp074-koren.pdf



Basic proposal replaced with proposal as submitted to GSoC.

Any feedback would be greatly appreciated so that I can continue to tweak to 
proposal.

Many Thanks
Richard

> [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-371
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-371
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Richard Simon Just
>
> Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ 
> Recommender using Hadoop
> Student Name: Richard Simon Just
> Student E-mail:i...@richardsimonjust.co.uk
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract: 
> During the Netflix Prize Challenge one of the most popular forms of 
> Recommender algorithm was that of Matrix Factorisation, in particular 
> Singular Value Decomposition (SVD). As such this proposal looks to implement 
> a distributed version of one of the most successful SVD-based recommender 
> algorithms from the Netflix competition. Namely, the SVD++ algorithm. 
> The SVD++ improves upon other basic SVD algorithms by incorporating implicit 
> feedback[1]. That is to say that it is able to take into account not just 
> explicit user preferences, but also feedback such as, in the case of a 
> company like Netflix, whether a movie has been rented. Implicit feedback 
> assumes that the fact of there being some correlation between the user and 
> the item is more important that whether the correlation is positive or 
> negative. Implicit feedback would account for an item has being rated, but 
> not what the rating was.
> The implementation will include testing, in-depth documentation and a 
> demo/tutorial. If there is time, I will also look to developing the algorithm 
> into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results 
> of the SVD algorithm by taking into account temporal dynamics. Temporal 
> dynamics addresses the way user preferences in items and their behaviour in 
> how they rate items can change over time. According to [2] the gains in 
> accuracy implementing timeSVD++ are significantly bigger than the gains going 
> from SVD to SVD++. 
> The overall project will provide three deliverables:
> 1. The basic framework for distributed SVD-based recommender
> 2. A distributed SVD++ implementation and demo
> 3. A distributed timeSVD++ implementation
> Detailed Description:
> The SVD++ algorithm uses the principle of categorising users and items into 
> factors, combined with regularisation and implicit feedback to predict how 
> much a user is likely to match an item. Factors are abstract categorises that 
> are created from comparing the data presented. Factor values are grades 
> saying how much each user/item is related to that category. For example with 
> the Netflix data a factor could loosely correspond to a movie genre, a 
> director or story line. The more factors used the more detailed the 
> categories are likely to become, and thus the more accurate the predictions 
> are likely to become. 
> Implicit feedback is based on the theory that a user having any sort of 
> relationship to an item is more important that whether they rated it, or what 
> rating they gave. The assumption is that even if a user does not like an 
> item, or has not rated it, the very fact that they choose to have some 
> interaction with it indicates something about their preferences. In the 
> Netflix case this would be represented by whether a user has rated a movie or 
> not,  it could also take into account the user's rental history. 
> As well as the actual implementation of the code the project has two other 
> deliverable focuses. The readability, documentation, testing of the code; and 
> a full tutorial and demo of the code. It is felt that without these things 
> the implementation is not really complete or fully usable. 
> The recommender consists of 5 main parts. The job class that runs the code, 
> an input conversion section, a training section, a re-trainer section and a 
> prediction section. A brief overview of these sections follows.
> The SVD++  Classes:
> The Recommender Job class:
> This class is the foundation of the recommender and allows it to run on 
> Hadoop by implementing the Tool interface through AbstractJob. This class 
> will parse any user arguments and setup the jobs that will run the algorithm 
> on Map/Reduce, much in the same way Mahout's other distributed recommenders, 
> do such as RecommenderJob.
> Input Mapper/Reducer Classes:
> The Mapper will take the input data and convert it to key value pairs in the 
> form of a hadoop Writeable. The Reducer will take the Mapper Writeables and 
> create Sparse vectors. This is in keeping with Mahout's ToItemPrefsMapper and 
> ToUserVectorReducer.
> Training Mapper/Reducer Classes:
> This phase creates the factor matrices that will be used to create the 
> predictions later. It does this by making a prediction of a known rating 
> using the SVD, comparing it against the known rating, calculating the error 
> and updating the factors accordingly. The Mapper will loop through the 
> training of the factor matrices. The Reducer will collect the outputs of the 
> Mapper to create the dense factor matrices.
> Re-trainer Mapper/Reducer Classes:
> This is a separate job that would be used to update the factor matrices 
> taking into account any new user ratings. The Mapper/Reducer follow a similar 
> pattern to the training section.
> Prediction Mapper/Reducer Classes:
> This job would take the factor matrices from the training set and use them in 
> the SVD to create the required predictions. Most likely this would be done 
> for all user/item preferences, to create dense matrix of all user/item 
> combinations. However potentially it could also be done for individual users.
> Just like training, the Mapper would handle the predictions the Reducer would 
> collect them into a dense matrix.
> Timeline:
> The Warm Up/Bonding Period (<=May 23rd):
> - familiarise myself further with Mahout and Hadoop's code base and 
> documentation
> - discuss with community the proposal, design and implementation as well as 
> related code tests, optimisations and documentation they would like to see 
> incorporated into the project
> - build a more detailed design of algorithm implementation and tweak timeline 
> based on feedback
> - familiarise myself more with unit testing
> - finish building 3-4 node Hadoop cluster and play with all the examples
> Week 1 (May 24th-30th):
> - start writing the back bone of the code in the form of comments and 
> skeleton code
> - implement SVDppRecommenderJob
> - start to integrate DistributedLanzcosSolver
> Week 2 (May 31st - June 6th):
> - complete DistributedLanzcosSolver integration
> - start implementing distributed training, prediction and regularisation
> Week 3 - 5 (June 7th - 27th):
> - complete implementation of distributed training, prediction and 
> regularisation
> - work on testing, cleaning up code, and tying up any loose documentation ends
> - work on any documentation, tests and optimisation requested by community
> - Deliverable : basic framework for distributed SVD-based recommender
> Week 6 - 7 (June 28th-July 11th):
> - start implementation of SVD++ (keeping documentation and tests up-to-date)
> - prepare demo
> Week 8 (July 12th - 18th): Mid-Term Report by the 16th
> - complete SVD++ and iron out bugs
> - implement and document demo
> - write wiki articles and tutorial for what has been implemented including 
> the demo
> Week 9 (July 19th - 25th):
> - work on any documentation, tests and optimisation requested by community 
> during project
> - work on testing, cleaning up code, and tying up any loose documentation ends
> - Deliverable : Distributed SVD++ Recommender (including Demo)
> Week 10 - 11 (July 26th - Aug 8th):
> - incorporate temporal dynamics
> - write temporal dynamics documentation, including wiki article
> Week 12 (Aug 9th - 15th): Suggested Pencils Down
> - last optimisation and tidy up of code, documentation, tests and demo
> - discuss with community what comes next, consider what JIRA issues to 
> contribute to 
> - Deliverable: Distributed SVD++ Recommender with temporal dynamics
> Final Evaluations Hand-in: Aug 16th-20th. 
> Additional Information:
> I am currently a final year undergraduate at the University of Reading, UK, 
> studying BSc Applied Computer Science. As part of this degree I did a 9 month 
> industrial placement programming in Java for an international non-profit 
> organisation. While in this final year of study I have been fortunate enough 
> to take modules in Distributed Computing, Evolutionary Computation and 
> Concurrent Systems, which have proven to be the best of my University career. 
> Each module required me to design and implement algorithms of either a 
> distributed or machine learning nature [A], two in Java, one in C. For my 
> final year project I have implemented a Gaming Persistence Platform in 
> ActionScript, PHP and MySQL. It allows the user to play computer games 
> integrated with the platform from different devices and/or different software 
> platforms while retaining game progress across all instances. I am looking 
> forward to the day when mobile computing goes distributed.
> I am a passionate user of open source software and now I am looking to become 
> involved as a developer. Having enjoyed my Distributed Computing and 
> Evolutionary Computation modules so much, and after reading all the 
> introductory pages about the ASF I realised Mahout would be a great place to 
> start. After graduation (and GSoC) I hope to continue contributing to Mahout 
> while working in a related field. 
> Outside of uni, when I'm not coding, I like to push my limits through 
> meditation, mountaineering and running. I also love to cook!
> After my last exam on May 21st my only commitment over the summer will be 
> GSoC. As such GSoC will be treated like a full-time job.
> [1] - Y. Koren, "Factorization Meets the Neighborhood: a Mulitfaceted 
> Collaborative Filtering Model", ACM Press, 2008, 
> http://public.research.att.com/~volinsky/netflix/kdd08koren.pdf
> [2] - Y. Koren, "Collaborative Filtering with temporal Dynamics", ACM Press, 
> 2009, http://research.yahoo.com/files/kdd-fp074-koren.pdf
> [3] - R.M.Bell, Y. Koren, and C. Volinsky, "The Bellkor 2008 Solution to the 
> Netflix Prize", 
> http://www.netflixprize.com/assets/ProgressPrize2008_BellKor.pdf
> [A] - Example of code I've implemented as part of University degree, 
> http://www.richardsimonjust.com/code/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

Reply via email to