Since Sean already answered IDEA-2, I'll reply to IDEA 1. Minhash (and Shingling in general) are very efficient clustering techniques that have traditionally been employed by Search engines for near-duplicate detection of web documents. They are known to be efficient and effective at web-scale. Also they have been applied successfully to "near-neighbor or similar ..." type of problems in recommendations, search, and various other domains on text, audio and image data. So I think they are quite "cool".
I personally have experimented with them in recommendations and found them to be surprisingly more effective than I anticipated especially when dealing with high dimensional data. I have a basic implementation that I submitted to mahout, see - https://issues.apache.org/jira/browse/MAHOUT-344. You are welcome to work on it if you'd like. Improving and integrating it with our recommender would make up for a good GSOC project IMO. Regards -...@nkur 3/19/10 7:04 PM, "cristi prodan" <prodan.crist...@gmail.com> wrote: Dear Mahout community, My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student pursuing a MSc degree in Computer Science. I started studying machine learning in the past year and during my research I found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was very impressed by the power of these frameowrks and their great potential. For this reason I would like to submit a proposal for this year Google Summer of Code competition. I have looked at the proposals made by Robin on JIRA (https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021). I have stopped at two ideas. I would like to ask for your help in deciding which idea would be best to pick. Since I've never done GSoC before, I'm hoping someone would advise on the size of the project (too small or two big for the summer period) and mostly, it's importance for the Mahout framwork. After hearing your answers my intentions are to fully focus on the thourough research of a single idea. IDEA 1 - MinHash clustering --------------------------- The first idea come after taking a look at Google 's paper on collaborative filtering for their news system[2]. In that paper, I looked at MinHash clustering. My first question is: is MinHash clustering considered cool ? If yes, than I would like to take a stab at implementing it. The paper also describes the implementation in a MapReduce style. Since this is only a suggestion I will not elaborate very much on the solution now. I would like to ask you weather this might be considered a good choice (i.e. important for the framework to have something like this) and if this is a big enough project. IDEA 2 - Additions to Taste Recommender --------------------------------------- As a second idea for this competition, was to add some capabilities to the Taste framework. I have revised a couple of papers from the Netflix contest winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. My idea was to implement a parallel prediction blending support by using linear regression or any other machine learning method - but so far I didn't got to a point where I would have a clear solution of this. I'm preparing my disertation paper on recommender systems and this was the first idea I got when thinking about participating to GSoC. If you have any ideas on this and want to share them, I would be very thankful. Thank you in advance. Best regards, Cristi. BIBLIOGRAPHY: --------------- [1] Owen, Anil - Mahout in Action. Manning, 2010. [2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News Personalization: Scalable Online Collaborative Filtering, WWW 2007.