Re: My ideas for GSoC 2010

Ankur C. Goel Mon, 22 Mar 2010 01:05:16 -0700

Since Sean already answered IDEA-2, I'll reply to IDEA 1.

Minhash (and Shingling in general) are very efficient clustering techniques 
that have traditionally been employed by Search engines for near-duplicate 
detection of web documents. They are known to be efficient and effective at 
web-scale. Also they have been applied successfully to "near-neighbor or 
similar ..." type of problems in recommendations, search, and various other 
domains on text, audio and image data. So I think they are quite "cool".


I personally have experimented with them in recommendations and found them to 
be surprisingly more effective than I anticipated especially when dealing with 
high dimensional data. I have a basic implementation that I submitted to 
mahout, see - https://issues.apache.org/jira/browse/MAHOUT-344. You are welcome 
to work on it if you'd like.

Improving and integrating it with our recommender would make up for a good GSOC 
project IMO.

Regards
-...@nkur

 3/19/10 7:04 PM, "cristi prodan" <prodan.crist...@gmail.com> wrote:

Dear Mahout community,

My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student 
pursuing a MSc degree in Computer Science.
I started studying machine learning in the past year and during my research I 
found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was 
very impressed by the power of these frameowrks and their great potential. For 
this reason I would like to submit a proposal for this year Google Summer of 
Code competition.

I have looked at the proposals made by Robin on JIRA 
(https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021).
 I have stopped at two ideas. I would like to ask for your help in deciding 
which idea would be best to pick. Since I've never done GSoC before, I'm hoping 
someone would advise on the size of the project (too small or two big for the 
summer period) and mostly, it's importance for the Mahout framwork. After 
hearing your answers my intentions are to fully focus on the thourough research 
of a single idea.


IDEA 1 - MinHash clustering
---------------------------
The first idea come after taking a look at Google 's paper on collaborative 
filtering for their news system[2]. In that paper, I looked at MinHash 
clustering.
My first question is: is MinHash clustering considered cool ? If yes, than I 
would like to take a stab at implementing it.
The paper also describes the implementation in a MapReduce style. Since this is 
only a suggestion I will not elaborate very much on the solution now. I would 
like to ask you weather this might be considered a good choice (i.e. important 
for the framework to have something like this) and if this is a big enough 
project.

IDEA 2 - Additions to Taste Recommender
---------------------------------------
As a second idea for this competition, was to add some capabilities to the 
Taste framework. I have revised a couple of papers from the Netflix contest 
winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. 
My idea was to implement a parallel prediction blending support by using linear 
regression or any other machine learning method - but so far I didn't got to a 
point where I would have a clear solution of this. I'm preparing my disertation 
paper on recommender systems and this was the first idea I got when thinking 
about participating to GSoC. If you have any ideas on this and want to share 
them, I would be very thankful.

Thank you in advance.

Best regards,
Cristi.

BIBLIOGRAPHY:
---------------
[1] Owen, Anil - Mahout in Action. Manning, 2010.

[2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News 
Personalization: Scalable Online Collaborative Filtering, WWW 2007.

Re: My ideas for GSoC 2010

Reply via email to