Re: New to Mahout with GSoC ambitions

2010-03-22 Thread Ankur C. Goel
I haven't really started any coding for the integration but was planning to 
this week. If a GSOC student is interested in taking over, I'll be happy to 
help.

We already have NB, LDA and SVD, so instead of coming up with yet another 
probabilistic model, a good add would be taking the existing fully distributed 
LDA and SVD implementations in Mahout and applying them in recommendations IMHO.

A solid fully distributed implementation of Restricted Boltzman's Machines 
(RBM) would make for superb GSOC project and will be quite challenging.

-...@nkur

3/19/10 5:50 PM, "Sean Owen"  wrote:

+mahout-user

>From a recommender perspective I can think of three worthwhile projects:

1. Combine the two co-occurrence-based distributed recommenders in the
code now. They take slightly different approaches. Ankur's working on
this but might give it over to a GSoC student. This is probably 1/2
the size of a proper GSoC project.

2. Add a fully distributed slope-one recommender. Part of the
computation is already distributed. Efficiently distributing the rest
is interesting. Also not so hard: I'd judge this is 1/2 a GSoC
project.

3. Implement a probabilistic model-based recommender of any kind,
distributed or non-distributed. This is probably a whole GSoC project.

On Fri, Mar 19, 2010 at 11:45 AM, RSJ  wrote:
> Hey there,
>
> My name is Richard Just, I'm a final year BSc Applied Computer Science
> student at Reading University, UK, with a strong focus on programming.
> I'm just finishing up a term that included modules in Distributed
> Computing and Evolutionary Computation, which have been the greatest
> modules of my uni career by far. Between that, my love for open source
> and having read about the ASF, I'm really interested in taking part in
> GSoC with an ASF project, namely Mahout. I'm really taken by the ethos
> behind the ASF as a whole and I'm hoping that taking part in GSoC will
> be the start of my long term involvement with ASF projects.
>
> My main programming background is Java, and I did a 9 month placement
> programming in it for a non-profit organisation last year. From that
> placement I gained a love and appreciation for well commented, well
> documented code, while from my time at university I now have a passion
> for well designed code and the time it saves.
>
> With GSoC, I've read through the suggested Mahout projects so far, and I
> think implementing an algorithm is probably my best bet. I say that
> because I don't have much Mahout experience yet, but through multiple
> University modules I do have experience designing and implementing
> algorithms. With that in mind and given that there is already a
> Classifier proposal, I was thinking either a Cluster or Recommendation
> algorithm.
>
> I'd be very interested in hearing if there are any particular Clustering
> algorithms or particular elements of the top Netflix team solutions
> people would like to see implemented?
>
> Many thanks for reading this
> RSJ
>



[jira] Created: (MAHOUT-344) Minhash based clustering

2010-03-22 Thread Ankur (JIRA)
Minhash based clustering 
-

 Key: MAHOUT-344
 URL: https://issues.apache.org/jira/browse/MAHOUT-344
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Reporter: Ankur


Minhash clustering performs probabilistic dimension reduction of high 
dimensional data. The essence of the technique is to hash each item using 
multiple independent hash functions such that the probability of collision of 
similar items is higher. Multiple such hash tables can then be constructed  to 
answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-344) Minhash based clustering

2010-03-22 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-344:
-

Affects Version/s: 0.3
 Assignee: Ankur

> Minhash based clustering 
> -
>
> Key: MAHOUT-344
> URL: https://issues.apache.org/jira/browse/MAHOUT-344
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.3
>Reporter: Ankur
>Assignee: Ankur
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-344) Minhash based clustering

2010-03-22 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-344:
-

Attachment: MAHOUT-344-v1.patch

As per "Yonik's law of patches" submitting my implementation. Please feel free 
to provide ideas for improvement or even submit an improved patch. 

> Minhash based clustering 
> -
>
> Key: MAHOUT-344
> URL: https://issues.apache.org/jira/browse/MAHOUT-344
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.3
>Reporter: Ankur
>Assignee: Ankur
> Attachments: MAHOUT-344-v1.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: My ideas for GSoC 2010

2010-03-22 Thread Ankur C. Goel
Since Sean already answered IDEA-2, I'll reply to IDEA 1.

Minhash (and Shingling in general) are very efficient clustering techniques 
that have traditionally been employed by Search engines for near-duplicate 
detection of web documents. They are known to be efficient and effective at 
web-scale. Also they have been applied successfully to "near-neighbor or 
similar ..." type of problems in recommendations, search, and various other 
domains on text, audio and image data. So I think they are quite "cool".

I personally have experimented with them in recommendations and found them to 
be surprisingly more effective than I anticipated especially when dealing with 
high dimensional data. I have a basic implementation that I submitted to 
mahout, see - https://issues.apache.org/jira/browse/MAHOUT-344. You are welcome 
to work on it if you'd like.

Improving and integrating it with our recommender would make up for a good GSOC 
project IMO.

Regards
-...@nkur

 3/19/10 7:04 PM, "cristi prodan"  wrote:

Dear Mahout community,

My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student 
pursuing a MSc degree in Computer Science.
I started studying machine learning in the past year and during my research I 
found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was 
very impressed by the power of these frameowrks and their great potential. For 
this reason I would like to submit a proposal for this year Google Summer of 
Code competition.

I have looked at the proposals made by Robin on JIRA 
(https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021).
 I have stopped at two ideas. I would like to ask for your help in deciding 
which idea would be best to pick. Since I've never done GSoC before, I'm hoping 
someone would advise on the size of the project (too small or two big for the 
summer period) and mostly, it's importance for the Mahout framwork. After 
hearing your answers my intentions are to fully focus on the thourough research 
of a single idea.


IDEA 1 - MinHash clustering
---
The first idea come after taking a look at Google 's paper on collaborative 
filtering for their news system[2]. In that paper, I looked at MinHash 
clustering.
My first question is: is MinHash clustering considered cool ? If yes, than I 
would like to take a stab at implementing it.
The paper also describes the implementation in a MapReduce style. Since this is 
only a suggestion I will not elaborate very much on the solution now. I would 
like to ask you weather this might be considered a good choice (i.e. important 
for the framework to have something like this) and if this is a big enough 
project.

IDEA 2 - Additions to Taste Recommender
---
As a second idea for this competition, was to add some capabilities to the 
Taste framework. I have revised a couple of papers from the Netflix contest 
winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. 
My idea was to implement a parallel prediction blending support by using linear 
regression or any other machine learning method - but so far I didn't got to a 
point where I would have a clear solution of this. I'm preparing my disertation 
paper on recommender systems and this was the first idea I got when thinking 
about participating to GSoC. If you have any ideas on this and want to share 
them, I would be very thankful.

Thank you in advance.

Best regards,
Cristi.

BIBLIOGRAPHY:
---
[1] Owen, Anil - Mahout in Action. Manning, 2010.

[2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News 
Personalization: Scalable Online Collaborative Filtering, WWW 2007.




Re: New to Mahout with GSoC ambitions

2010-03-22 Thread Sean Owen
I'm referring to the code in org.apache.mahout.cf.taste.hadoop.item .
Same algorithm really, different implementation. The task remains to
combine them and take the best parts of both. The major difference is
that one does the final matrix / user-vector multiplication in a
distributed way and the other doesn't, and the question is what runs
faster and scales better.

On Sun, Mar 21, 2010 at 5:13 PM, Claudio Martella
 wrote:
> Would you tell more about this point? I'm looking at
> trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/cooccurence
>
> and I can find only one co-occurrence-based recommender following the
> path
> ItemBigraGenerator,ItemSimilarityEstimator,UserItemJoiner,UserItemRecommender.


Progress with Eclipse

2010-03-22 Thread Benson Margulies
I need a brave eclipse user to try out the following, and then I have
some questions for others.

1) Remove all .project and .classpath files from your tree.
2) cd to 'eclipse'.
3) Pick a new pathname for an eclipse workspace (call it WORKSPACE) in
the following.
4) mvn -Psetup-eclipse-workspace -Declipse.workspace.dir=WORKSPACE

This much will create WORKSPACE, copy some files into it, and set some
global options.

5) cd .. (to the mahout top)
6) mvn -Psetup.eclipse -Declipse.workspace=WORKSPACE

Now there will be .project and .classpath files.

7) start eclipse, select WORKSPACE
8) import projects from the mahout toplevel

If all goes well, you will be presented with a lot of PMD complaints.
I turned on PMD as part of the show, and it seems that we have a
supply of PMD non-compliance. What do people think about the PMD rules
we have checked in? Do we want to conform to them?


Re: New to Mahout with GSoC ambitions

2010-03-22 Thread Claudio Martella
Sean Owen wrote:
> +mahout-user
>
> From a recommender perspective I can think of three worthwhile projects:
>
> 1. Combine the two co-occurrence-based distributed recommenders in the
> code now. They take slightly different approaches. Ankur's working on
> this but might give it over to a GSoC student. This is probably 1/2
> the size of a proper GSoC project.
>   

Would you tell more about this point? I'm looking at
trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/cooccurence

and I can find only one co-occurrence-based recommender following the
path
ItemBigraGenerator,ItemSimilarityEstimator,UserItemJoiner,UserItemRecommender.

Thanks
> 2. Add a fully distributed slope-one recommender. Part of the
> computation is already distributed. Efficiently distributing the rest
> is interesting. Also not so hard: I'd judge this is 1/2 a GSoC
> project.
>
> 3. Implement a probabilistic model-based recommender of any kind,
> distributed or non-distributed. This is probably a whole GSoC project.
>
> On Fri, Mar 19, 2010 at 11:45 AM, RSJ  wrote:
>   
>> Hey there,
>>
>> My name is Richard Just, I'm a final year BSc Applied Computer Science
>> student at Reading University, UK, with a strong focus on programming.
>> I'm just finishing up a term that included modules in Distributed
>> Computing and Evolutionary Computation, which have been the greatest
>> modules of my uni career by far. Between that, my love for open source
>> and having read about the ASF, I'm really interested in taking part in
>> GSoC with an ASF project, namely Mahout. I'm really taken by the ethos
>> behind the ASF as a whole and I'm hoping that taking part in GSoC will
>> be the start of my long term involvement with ASF projects.
>>
>> My main programming background is Java, and I did a 9 month placement
>> programming in it for a non-profit organisation last year. From that
>> placement I gained a love and appreciation for well commented, well
>> documented code, while from my time at university I now have a passion
>> for well designed code and the time it saves.
>>
>> With GSoC, I've read through the suggested Mahout projects so far, and I
>> think implementing an algorithm is probably my best bet. I say that
>> because I don't have much Mahout experience yet, but through multiple
>> University modules I do have experience designing and implementing
>> algorithms. With that in mind and given that there is already a
>> Classifier proposal, I was thinking either a Cluster or Recommendation
>> algorithm.
>>
>> I'd be very interested in hearing if there are any particular Clustering
>> algorithms or particular elements of the top Netflix team solutions
>> people would like to see implemented?
>>
>> Many thanks for reading this
>> RSJ
>>
>> 
>
>   


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.




Re: [VOTE] Mahout as TLP

2010-03-22 Thread Grant Ingersoll
This vote has passed:
+1s: 11 (all but one are binding)
+0: 1

I'll submit to the PMC.


On Mar 19, 2010, at 10:50 AM, Grant Ingersoll wrote:

> Per the earlier discussions, I'm calling a vote to submit the following 
> resolution [1] to the Lucene PMC for consideration to then promote Mahout to 
> be a TLP.
> 
> [] +1  I'm for Mahout being a TLP and the resolution below.
> [] 0 No opinion
> [] -1 Bad idea.  Please give justification.
> 
> Majority wins (i.e. no vetos).  Vote will be open for 72 hours.  
> 
> 
> [1]
> X. Establish the Apache Mahout Project
> 
> WHEREAS, the Board of Directors deems it to be in the best
> interests of the Foundation and consistent with the
> Foundation's purpose to establish a Project Management
> Committee charged with the creation and maintenance of
> open-source software related to a machine learning platform
> for distribution at no charge to the public.
> 
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> Committee (PMC), to be known as the "Apache Mahout Project",
> be and hereby is established pursuant to Bylaws of the
> Foundation; and be it further
> 
> RESOLVED, that the Apache Mahout Project be and hereby is
> responsible for the creation and maintenance of software
> related to a machine learning platform; and be it further
> 
> RESOLVED, that the office of "Vice President, Apache Mahout" be
> and hereby is created, the person holding such office to
> serve at the direction of the Board of Directors as the chair
> of the Apache Mahout Project, and to have primary responsibility
> for management of the projects within the scope of
> responsibility of the Apache Mahout Project; and be it further
> 
> RESOLVED, that the persons listed immediately below be and
> hereby are appointed to serve as the initial members of the
> Apache Mahout Project:
> 
>   • Abdelhakim Deneche 
>   • Isabel Drost (isa...@...)
>   • Ted Dunning (tdunn...@...)
>   • Jeff Eastman (jeast...@...)
>   • Drew Farris (d...@...)
>   • Grant Ingersoll (gsing...@...)
>  • Benson Margulies (bimargul...@...)
>   • Sean Owen (sro...@...) 
>   • Robin Anil (robina...@...)
>   • Jake Mannix  (jman...@...)   
> 
> RESOLVED, that the Apache Mahout Project be and hereby
> is tasked with the migration and rationalization of the Apache
> Lucene Mahout sub-project; and be it further
> 
> RESOLVED, that all responsibilities pertaining to the Apache
> Lucene Mahout sub-project encumbered upon the
> Apache Mahout Project are hereafter discharged.
> 
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Sean Owen
> be appointed to the office of Vice President, Apache Mahout, to
> serve in accordance with and subject to the direction of the
> Board of Directors and the Bylaws of the Foundation until
> death, resignation, retirement, removal or disqualification,
> or until a successor is appointed.



Fw: Mentors for GSoC

2010-03-22 Thread Isabel Drost

Potential GSoC mentors - please tell Noirin who you are, if you want to
mentor a student for Mahout. More details below. If you have not done
so already, please also subscribe to code-awa...@apache.org for more
information on GSoC at Apache.


Begin forwarded message:

Date: Mon, 22 Mar 2010 15:48:17 +0100
From: Noirin Shirley 
To: code-awa...@apache.org
Subject: Mentors for GSoC


Thanks to all those who've already signed up to be mentors at
http://socghop.appspot.com/ !

Unfortunately, the ASF is a big Foundation, and I don't know who all
those who've signed up are. All I see is whatever's set as your LinkID
and Public Name in your profile on the webapp.

I can work out who "Grant Ingersoll(gsingers)" is, and I can even give
a reasonable guess as to who "isabel(isabel)" might be, but relying on
me to know the names of all the people who might mentor, and to be
able to tell who's a student who's clicked the wrong button, isn't
really going to scale!

So please, it would make my job much easier if you could drop a mail
to this list with your LinkID when you sign up to be a mentor :-)

Thanks a million!

Noirin


Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Sisir Koppaka
Dear Robin & the Apache Mahout team,
I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've
contributed to open source projects like FFmpeg earlier(Repository diff
links are 
hereand
here
), and I am very interested to work on a project for Apache Mahout this
year(the Netflix algorithms project, to be precise - mentored by Robin).
Kindly let me explain my background so that I can make myself relevant in
this context.

I've done research work in meta-heuristics, including proposing the
equivalents of local search and mutation for quantum-inspired algorithms, in
my paper titled "*Superior Exploration-Exploitation Balance With
Quantum-Inspired Hadamard Walks*", that was accepted as a late-breaking
paper at GECCO 2010. We(myself and a friend - it was an independent work),
hope to send an expanded version of the communication to a journal in the
near future. For this project, our language of implementation was in
Mathematica, as we needed the combination of functional paradigms and
available mathematically sound resources(like biased random number
generation, simple linear programming functions etc.) as well as rapid
prototyping ability.

I have earlier interned in GE Research in their Computing and Decision
Sciences 
Lablast
year, where I worked on machine learning techniques for large-scale
databases - specifically on the Netflix Prize itself. Over a 2 month
internship we rose from 1800 to 409th position on the Leaderboard, and had
implemented at least one variant of each of the major algorithms. The
contest ended at the same time as the conclusion of our internship, and the
winning result was the combination of multiple variants of our implemented
algorithms.

Interestingly, we did try to use Hadoop and the Map-Reduce model for the
purpose based on a talk from a person from Yahoo! who visited us during that
time. However, not having access to a cluster proved to be an impedance for
fast iterative development. We had one machine of 16 cores, so we developed
a toolkit in C++ that could multiprocess upto 16 threads(data input
parallelization, rather than modifying the algorithms to suit the Map-Reduce
model), and implemented all our algorithms using the same toolkit.
Specifically, SVD, kNN Movie-Movie, kNN User-User, NSVD(Bellkor and other
variants like the Paterek SVD, and the temporal SVD++ too) were the major
algorithms that we implemented. Some algorithms had readily available open
source code for the Netflix Prize, like NSVD1, so we used them as well. We
also worked on certain regression schemes that could improve prediction
accuracy like kernel-ridge regression, and it's optimization.

Towards the end, we also attempted to verify the results of the infamous
paper that showed that IMDB-Netflix correlation could destroy privacy, and
identify users. We would import IMDB datasets, and put them into a database
and then correlate the IMDB entries to Netflix(we matched double the number
of movies that the paper mentioned), and then verify the results. We also
identified genre-wise trends and recorded them as such. Unfortuantely, the
paper resulted in a libel case, wherein Netflix surrendered it's rights to
hold future Prizes of this kind in return for withdrawal of charges. The
case effectively closed the possibility of Netflix or any other company
releasing similar datasets to the public pending further advances in privacy
enforcement techniques, leaving the Netflix database as the largest of it's
kind.

Naturally, it is very interesting if there is an opportunity to implement
the same for Hadoop, since Netflix is the largest database of it's kind and
this would be of multiple uses - as a kickstart, tutorial, and as a
performance testing(using an included segment of the total database) tool
for grids. In addition, the Netflix and IMDB framework base code could be
incredibly useful as a prototyping tool for algorithm designers who want to
design better machine learning algorithms/ensembles for large-scale
databases. The lack of a decent scalable solution for the Netflix Prize so
that people can learn from it/add to it, is also a major disappointment that
this proposal hopes to correct.

Specifically, the challenge I am personally looking forward to is to
implement some of these algorithms using the Map-Reduce model, which is what
I missed out last time. I am looking forward to:
1. Account for the 12 global effects. The global effects were 12 corrections
made in the dataset to account for time, multiple votes on a single day
etc., and these alone give a major boost to the accuracy of the predictions.
2. Implement at least 1 kNN based approach. Most of them differ in their
parameters, or in their choice of distance definitions(Pearson, Cosine,
Euclid

[jira] Created: (MAHOUT-345) [GSOC] integrate Mahout with Drupal/PHP

2010-03-22 Thread Daniel Xiaodan Zhou (JIRA)
[GSOC] integrate Mahout with Drupal/PHP
---

 Key: MAHOUT-345
 URL: https://issues.apache.org/jira/browse/MAHOUT-345
 Project: Mahout
  Issue Type: Task
  Components: Website
Reporter: Daniel Xiaodan Zhou


Drupal is a very popular open source web content management system. It's been 
widely used in e-commerce sites, media sites, etc. This is a list of famous 
site using Drupal: 
http://socialcmsbuzz.com/45-drupal-sites-which-you-may-not-have-known-were-drupal-based-24092008/

Integrate Mahout with Drupal would greatly increase the impact of Mahout in web 
systems: any Drupal website can easily use Mahout to make content 
recommendations or cluster contents.

I'm a PhD student at University of Michigan, with a research focus on 
recommender systems. Last year I participated GSOC 2009 with Drupal.org, and 
developed a recommender system for Drupal. But that module was not as 
sophisticated as Mahout. And I think it would be nice just to integrate Mahout 
into Drupal rather than developing a separate Mahout-like module for Drupal.

Any comments? I can provide more information if people here are interested. 
Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Robin Anil
Hi Sisir,
  I am currently on vacation. So wont be able to review your
proposal fully. But from the looks of it what I would suggest you is to
target a somewhat lower and practical proposal. Trust me converting these
algorithms to map/reduce is not as easy as it sounds and most of the time
you would spend in debugging your code. Your work history is quite
impressive but whats more important here is getting your proposal right.
Sean has written most of the recommender code of Mahout and would be best to
give you feedback as he has tried quite a number of approaches to
recommenders on map/reduce and knows very well, some of the constraints of
the framework. Feel free to explore the current Mahout recommenders code and
ask on the list if you find anything confusing. But remember you are trying
to reproduce some of the cutting edge work in Recommendations over 2 years
in a span of 10 weeks :) so stop and ponder over the feasibility. If you
still are good to go then prolly, you need to demonstrate something in terms
of code during the proposal period(which is optional).

Don't take this in the wrong way, its not meant to demotivate you. If we can
get this into mahout, I am sure noone here would be objecting to it. So your
good next step would be read, explore, think, discuss.

Regards
Robin


On Mon, Mar 22, 2010 at 4:36 PM, Sisir Koppaka wrote:

> Dear Robin & the Apache Mahout team,
> I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've
> contributed to open source projects like FFmpeg earlier(Repository diff
> links are here<
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
> >and
> here<
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd
> >
> ), and I am very interested to work on a project for Apache Mahout this
> year(the Netflix algorithms project, to be precise - mentored by Robin).
> Kindly let me explain my background so that I can make myself relevant in
> this context.
>
> I've done research work in meta-heuristics, including proposing the
> equivalents of local search and mutation for quantum-inspired algorithms,
> in
> my paper titled "*Superior Exploration-Exploitation Balance With
> Quantum-Inspired Hadamard Walks*", that was accepted as a late-breaking
> paper at GECCO 2010. We(myself and a friend - it was an independent work),
> hope to send an expanded version of the communication to a journal in the
> near future. For this project, our language of implementation was in
> Mathematica, as we needed the combination of functional paradigms and
> available mathematically sound resources(like biased random number
> generation, simple linear programming functions etc.) as well as rapid
> prototyping ability.
>
> I have earlier interned in GE Research in their Computing and Decision
> Sciences Lab<
> http://ge.geglobalresearch.com/technologies/computing-decision-sciences/
> >last
> year, where I worked on machine learning techniques for large-scale
> databases - specifically on the Netflix Prize itself. Over a 2 month
> internship we rose from 1800 to 409th position on the Leaderboard, and had
> implemented at least one variant of each of the major algorithms. The
> contest ended at the same time as the conclusion of our internship, and the
> winning result was the combination of multiple variants of our implemented
> algorithms.
>
> Interestingly, we did try to use Hadoop and the Map-Reduce model for the
> purpose based on a talk from a person from Yahoo! who visited us during
> that
> time. However, not having access to a cluster proved to be an impedance for
> fast iterative development. We had one machine of 16 cores, so we developed
> a toolkit in C++ that could multiprocess upto 16 threads(data input
> parallelization, rather than modifying the algorithms to suit the
> Map-Reduce
> model), and implemented all our algorithms using the same toolkit.
> Specifically, SVD, kNN Movie-Movie, kNN User-User, NSVD(Bellkor and other
> variants like the Paterek SVD, and the temporal SVD++ too) were the major
> algorithms that we implemented. Some algorithms had readily available open
> source code for the Netflix Prize, like NSVD1, so we used them as well. We
> also worked on certain regression schemes that could improve prediction
> accuracy like kernel-ridge regression, and it's optimization.
>
> Towards the end, we also attempted to verify the results of the infamous
> paper that showed that IMDB-Netflix correlation could destroy privacy, and
> identify users. We would import IMDB datasets, and put them into a database
> and then correlate the IMDB entries to Netflix(we matched double the number
> of movies that the paper mentioned), and then verify the results. We also
> identified genre-wise trends and recorded them as such. Unfortuantely, the
> paper resulted in a libel case, wherein Netflix surrendered it's rights to
> hold future Prizes of this kind in return for withdrawal of charges. The
>

Re: [jira] Created: (MAHOUT-345) [GSOC] integrate Mahout with Drupal/PHP

2010-03-22 Thread Robin Anil
This is a very interesting proposal =) I am sure no-one would have seen this
coming even a month ago. Will have to see how feasible this is.

Robin


On Mon, Mar 22, 2010 at 7:51 PM, Daniel Xiaodan Zhou (JIRA)  wrote:

> [GSOC] integrate Mahout with Drupal/PHP
> ---
>
> Key: MAHOUT-345
> URL: https://issues.apache.org/jira/browse/MAHOUT-345
> Project: Mahout
>  Issue Type: Task
>  Components: Website
>Reporter: Daniel Xiaodan Zhou
>
>
> Drupal is a very popular open source web content management system. It's
> been widely used in e-commerce sites, media sites, etc. This is a list of
> famous site using Drupal:
> http://socialcmsbuzz.com/45-drupal-sites-which-you-may-not-have-known-were-drupal-based-24092008/
>
> Integrate Mahout with Drupal would greatly increase the impact of Mahout in
> web systems: any Drupal website can easily use Mahout to make content
> recommendations or cluster contents.
>
> I'm a PhD student at University of Michigan, with a research focus on
> recommender systems. Last year I participated GSOC 2009 with Drupal.org, and
> developed a recommender system for Drupal. But that module was not as
> sophisticated as Mahout. And I think it would be nice just to integrate
> Mahout into Drupal rather than developing a separate Mahout-like module for
> Drupal.
>
> Any comments? I can provide more information if people here are interested.
> Thanks.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


git or svn

2010-03-22 Thread D L
Hello,

Apologies for the stupid question of the day... if i want to see the latest
mahout commits, should i check out from apache git repository or the svn
repostiory tree given on the mahout dev page? Which one do you guys commit
to as primary one anway?

Thank you very much.

Best

Dmitriy


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Sisir Koppaka
Hi,
Thanks a lot for taking time out to reply. I understand that it's important
to get the proposal right - that's why I wanted to bounce off all
possibilites as far as Netflix is concerned - the methods that I've worked
with before, on this list, and see what would be of priority interest to the
team. If global effects, and temporal SVD would be of interest then I'd
incorporate that into my final proposal accordingly. On the other hand, I've
read that RBM is something the team is interested in, so I could also
implement a very good performing(approximately 0.91 RMSE) RBM for Netflix,
as the GSoC project. I'd like to know which of the Netflix algorithms the
Mahout team would like to see implemented first.

Depending on the feedback, I'll prepare the final proposal. I'll definitely
work with the code now and post any queries that I get on the list.

Thanks a lot,
Best regards,
Sisir Koppaka

On Tue, Mar 23, 2010 at 1:22 AM, Robin Anil  wrote:

> Hi Sisir,
>  I am currently on vacation. So wont be able to review your
> proposal fully. But from the looks of it what I would suggest you is to
> target a somewhat lower and practical proposal. Trust me converting these
> algorithms to map/reduce is not as easy as it sounds and most of the time
> you would spend in debugging your code. Your work history is quite
> impressive but whats more important here is getting your proposal right.
> Sean has written most of the recommender code of Mahout and would be best
> to
> give you feedback as he has tried quite a number of approaches to
> recommenders on map/reduce and knows very well, some of the constraints of
> the framework. Feel free to explore the current Mahout recommenders code
> and
> ask on the list if you find anything confusing. But remember you are trying
> to reproduce some of the cutting edge work in Recommendations over 2 years
> in a span of 10 weeks :) so stop and ponder over the feasibility. If you
> still are good to go then prolly, you need to demonstrate something in
> terms
> of code during the proposal period(which is optional).
>
> Don't take this in the wrong way, its not meant to demotivate you. If we
> can
> get this into mahout, I am sure noone here would be objecting to it. So
> your
> good next step would be read, explore, think, discuss.
>
> Regards
> Robin
>
>
> On Mon, Mar 22, 2010 at 4:36 PM, Sisir Koppaka  >wrote:
>
> > Dear Robin & the Apache Mahout team,
> > I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've
> > contributed to open source projects like FFmpeg earlier(Repository diff
> > links are here<
> >
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
> > >and
> > here<
> >
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd
> > >
> > ), and I am very interested to work on a project for Apache Mahout this
> > year(the Netflix algorithms project, to be precise - mentored by Robin).
> > Kindly let me explain my background so that I can make myself relevant in
> > this context.
> >
> > I've done research work in meta-heuristics, including proposing the
> > equivalents of local search and mutation for quantum-inspired algorithms,
> > in
> > my paper titled "*Superior Exploration-Exploitation Balance With
> > Quantum-Inspired Hadamard Walks*", that was accepted as a late-breaking
> > paper at GECCO 2010. We(myself and a friend - it was an independent
> work),
> > hope to send an expanded version of the communication to a journal in the
> > near future. For this project, our language of implementation was in
> > Mathematica, as we needed the combination of functional paradigms and
> > available mathematically sound resources(like biased random number
> > generation, simple linear programming functions etc.) as well as rapid
> > prototyping ability.
> >
> > I have earlier interned in GE Research in their Computing and Decision
> > Sciences Lab<
> > http://ge.geglobalresearch.com/technologies/computing-decision-sciences/
> > >last
> > year, where I worked on machine learning techniques for large-scale
> > databases - specifically on the Netflix Prize itself. Over a 2 month
> > internship we rose from 1800 to 409th position on the Leaderboard, and
> had
> > implemented at least one variant of each of the major algorithms. The
> > contest ended at the same time as the conclusion of our internship, and
> the
> > winning result was the combination of multiple variants of our
> implemented
> > algorithms.
> >
> > Interestingly, we did try to use Hadoop and the Map-Reduce model for the
> > purpose based on a talk from a person from Yahoo! who visited us during
> > that
> > time. However, not having access to a cluster proved to be an impedance
> for
> > fast iterative development. We had one machine of 16 cores, so we
> developed
> > a toolkit in C++ that could multiprocess upto 16 threads(data input
> > parallelization, rather than modifying the algorithms to suit the
> > Map-Reduce
> 

Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Jake Mannix
Hi Sisir,

  I'm the one who added the most recent SVD stuff to Mahout,
and while I'd love to see improvements in that, and incorporation
into netflix-style recommenders, I would even more like to see
a stacked-RBM implementation if you think you can do that.  We
don't currently have anything like that in Mahout, and even if
it is completely serial (no Hadoop-hooks), it would be very
excellent as an addition to the project.

  That's my opinion on the matter. :)

  -jake

On Mon, Mar 22, 2010 at 1:53 PM, Sisir Koppaka wrote:

> Hi,
> Thanks a lot for taking time out to reply. I understand that it's important
> to get the proposal right - that's why I wanted to bounce off all
> possibilites as far as Netflix is concerned - the methods that I've worked
> with before, on this list, and see what would be of priority interest to
> the
> team. If global effects, and temporal SVD would be of interest then I'd
> incorporate that into my final proposal accordingly. On the other hand,
> I've
> read that RBM is something the team is interested in, so I could also
> implement a very good performing(approximately 0.91 RMSE) RBM for Netflix,
> as the GSoC project. I'd like to know which of the Netflix algorithms the
> Mahout team would like to see implemented first.
>
> Depending on the feedback, I'll prepare the final proposal. I'll definitely
> work with the code now and post any queries that I get on the list.
>
> Thanks a lot,
> Best regards,
> Sisir Koppaka
>
> On Tue, Mar 23, 2010 at 1:22 AM, Robin Anil  wrote:
>
> > Hi Sisir,
> >  I am currently on vacation. So wont be able to review your
> > proposal fully. But from the looks of it what I would suggest you is to
> > target a somewhat lower and practical proposal. Trust me converting these
> > algorithms to map/reduce is not as easy as it sounds and most of the time
> > you would spend in debugging your code. Your work history is quite
> > impressive but whats more important here is getting your proposal right.
> > Sean has written most of the recommender code of Mahout and would be best
> > to
> > give you feedback as he has tried quite a number of approaches to
> > recommenders on map/reduce and knows very well, some of the constraints
> of
> > the framework. Feel free to explore the current Mahout recommenders code
> > and
> > ask on the list if you find anything confusing. But remember you are
> trying
> > to reproduce some of the cutting edge work in Recommendations over 2
> years
> > in a span of 10 weeks :) so stop and ponder over the feasibility. If you
> > still are good to go then prolly, you need to demonstrate something in
> > terms
> > of code during the proposal period(which is optional).
> >
> > Don't take this in the wrong way, its not meant to demotivate you. If we
> > can
> > get this into mahout, I am sure noone here would be objecting to it. So
> > your
> > good next step would be read, explore, think, discuss.
> >
> > Regards
> > Robin
> >
> >
> > On Mon, Mar 22, 2010 at 4:36 PM, Sisir Koppaka  > >wrote:
> >
> > > Dear Robin & the Apache Mahout team,
> > > I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've
> > > contributed to open source projects like FFmpeg earlier(Repository diff
> > > links are here<
> > >
> >
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
> > > >and
> > > here<
> > >
> >
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd
> > > >
> > > ), and I am very interested to work on a project for Apache Mahout this
> > > year(the Netflix algorithms project, to be precise - mentored by
> Robin).
> > > Kindly let me explain my background so that I can make myself relevant
> in
> > > this context.
> > >
> > > I've done research work in meta-heuristics, including proposing the
> > > equivalents of local search and mutation for quantum-inspired
> algorithms,
> > > in
> > > my paper titled "*Superior Exploration-Exploitation Balance With
> > > Quantum-Inspired Hadamard Walks*", that was accepted as a late-breaking
> > > paper at GECCO 2010. We(myself and a friend - it was an independent
> > work),
> > > hope to send an expanded version of the communication to a journal in
> the
> > > near future. For this project, our language of implementation was in
> > > Mathematica, as we needed the combination of functional paradigms and
> > > available mathematically sound resources(like biased random number
> > > generation, simple linear programming functions etc.) as well as rapid
> > > prototyping ability.
> > >
> > > I have earlier interned in GE Research in their Computing and Decision
> > > Sciences Lab<
> > >
> http://ge.geglobalresearch.com/technologies/computing-decision-sciences/
> > > >last
> > > year, where I worked on machine learning techniques for large-scale
> > > databases - specifically on the Netflix Prize itself. Over a 2 month
> > > internship we rose from 1800 to 409th position on the Leaderboard, and
> > had
> > > 

Re: git or svn

2010-03-22 Thread Jake Mannix
Dmitriy,

  The official apache repository (where the committers write to) is
the subversion repo.  Git is just a clone/read-only mirror.  But since
you're not writing to either of them, use whichever you are more
comfortable working with. :)

  -jake

On Mon, Mar 22, 2010 at 1:03 PM, D L  wrote:

> Hello,
>
> Apologies for the stupid question of the day... if i want to see the latest
> mahout commits, should i check out from apache git repository or the svn
> repostiory tree given on the mahout dev page? Which one do you guys commit
> to as primary one anway?
>
> Thank you very much.
>
> Best
>
> Dmitriy
>


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Ted Dunning
It is also typically the case that about 2/3 of the project consists of
moving the code from "working" to "usable".  That involves writing up
examples, wiki pages and ironing out strangeness in how the program is
invoked or how the output might be consumed.

On Mon, Mar 22, 2010 at 12:52 PM, Robin Anil  wrote:

>  But from the looks of it what I would suggest you is to
> target a somewhat lower and practical proposal. Trust me converting these
> algorithms to map/reduce is not as easy as it sounds and most of the time
> you would spend in debugging your code.
>


Re: [jira] Created: (MAHOUT-345) [GSOC] integrate Mahout with Drupal/PHP

2010-03-22 Thread Ted Dunning
What kind of integration would you be thinking of?

Drupal is largely PHP based which should make it relatively easy to build a
module that uses thrift to call java recommendation functions.

Would that be what you are talking about doing?  That might be of real
value.

The questions I have are:

a) how stable could the module be given the famously "no backward
compatibility guarantee" in Drupal?

b) what does the Drupal community want/need for recommendations?

c) would the GPL of Drupal interact poorly with the Apache licensing of
Mahout?

d) where would the resulting code live?  In Drupal as a contributed module?
(that makes more sense to me)

e) how would a Drupal admin configure the recommendation capability?  Could
that configuration be made web accessible?

This proposal is exciting, not because of the technical difficulty, but
because it would make Mahout much more usable to a large number of sites.


On Mon, Mar 22, 2010 at 12:51 PM, Daniel Xiaodan Zhou (JIRA) <
j...@apache.org> wrote:

>
> I'm a PhD student at University of Michigan, with a research focus on
> recommender systems. Last year I participated GSOC 2009 with Drupal.org, and
> developed a recommender system for Drupal. But that module was not as
> sophisticated as Mahout. And I think it would be nice just to integrate
> Mahout into Drupal rather than developing a separate Mahout-like module for
> Drupal.
>
> Any comments? I can provide more information if people here are interested.
> Thanks.


Re: [jira] Created: (MAHOUT-345) [GSOC] integrate Mahout with Drupal/PHP

2010-03-22 Thread Grant Ingersoll
Taste already exposes a web service layer, so I'm not sure how much more there 
is on the Mahout end for recommenders.  Still, would be great to see and I'm 
sure it would help iron out API issues, etc.  These are definitely the kinds of 
things I'd love to see.


On Mar 22, 2010, at 5:56 PM, Ted Dunning wrote:

> What kind of integration would you be thinking of?
> 
> Drupal is largely PHP based which should make it relatively easy to build a
> module that uses thrift to call java recommendation functions.
> 
> Would that be what you are talking about doing?  That might be of real
> value.
> 
> The questions I have are:
> 
> a) how stable could the module be given the famously "no backward
> compatibility guarantee" in Drupal?
> 
> b) what does the Drupal community want/need for recommendations?
> 
> c) would the GPL of Drupal interact poorly with the Apache licensing of
> Mahout?
> 
> d) where would the resulting code live?  In Drupal as a contributed module?
> (that makes more sense to me)
> 
> e) how would a Drupal admin configure the recommendation capability?  Could
> that configuration be made web accessible?
> 
> This proposal is exciting, not because of the technical difficulty, but
> because it would make Mahout much more usable to a large number of sites.
> 
> 
> On Mon, Mar 22, 2010 at 12:51 PM, Daniel Xiaodan Zhou (JIRA) <
> j...@apache.org> wrote:
> 
>> 
>> I'm a PhD student at University of Michigan, with a research focus on
>> recommender systems. Last year I participated GSOC 2009 with Drupal.org, and
>> developed a recommender system for Drupal. But that module was not as
>> sophisticated as Mahout. And I think it would be nice just to integrate
>> Mahout into Drupal rather than developing a separate Mahout-like module for
>> Drupal.
>> 
>> Any comments? I can provide more information if people here are interested.
>> Thanks.




Re: [jira] Created: (MAHOUT-345) [GSOC] integrate Mahout with Drupal/PHP

2010-03-22 Thread Ted Dunning
I wonder if this project isn't actually a Drupal project rather than a
Mahout project?

Still valuable, but it seems to me that the best mentors will be Drupal
developers rather than Mahout developers.

On Mon, Mar 22, 2010 at 3:06 PM, Grant Ingersoll wrote:

> Taste already exposes a web service layer, so I'm not sure how much more
> there is on the Mahout end for recommenders.


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Sisir Koppaka
Hi Jake and Ted,
Thanks a lot for your feedback. I'll refocus my proposal on implementing RBM
for the Netflix dataset(there is some parallelization opportunity, I'm
working on figuring out the details). This would add in a new RBM algorithm,
as well as provide the hooks required for rapid iteration of an ensemble
approach for Netflix using already implemented algorithms. As Ted mentioned,
I'll also account for documentation and other associated works in the
timeline in the final proposal.

I'll do some more ground work at the algorithm and code level, and update on
the list with any queries that I have.

Please do give any other feedback that you think is relevant.

Best regards,
Sisir Koppaka

On Tue, Mar 23, 2010 at 2:39 AM, Jake Mannix  wrote:

> Hi Sisir,
>
>  I'm the one who added the most recent SVD stuff to Mahout,
> and while I'd love to see improvements in that, and incorporation
> into netflix-style recommenders, I would even more like to see
> a stacked-RBM implementation if you think you can do that.  We
> don't currently have anything like that in Mahout, and even if
> it is completely serial (no Hadoop-hooks), it would be very
> excellent as an addition to the project.
>
>  That's my opinion on the matter. :)
>
>  -jake
>
-- 
SK


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Sisir Koppaka
As far as I'm aware, RBM's for Netflix were pioneered by Ruslan
Salakhtudinov  and Geoffrey Hinton
of UToronto. I'm planning to base my code on their algorithms. If there's
another source that you prefer I keep as a reference in addition to the
above, then please do let me know.


Fwd: riffle ... small scale workflow manager

2010-03-22 Thread Ted Dunning
I was talking a few weeks ago with Chris Wensel of Cascading fame.  The
topic was how projects like Mahout need workflow for gluing together
analysis steps (Jake's command line props stuff not-withstanding) but
existing workflow systems have trouble helping us out.  Cascading is GPL,
Ouzle is perpetually not quite ready and pig is too ego-centric to make it
easy to integrate.

The discussion verged into Chris' thoughts about how annotations should be
useful for making different kinds of programs amenable to integration into
workflow systems.  Out of this came the idea that there could be a non-GPL
annotation that could be be consumed by a GPL workflow system without any
license question.  It became clear that this annotation idea would also make
it easy to integrate tasks into different workflow systems.

At this point Chris did as Chris tends to do and he took action.  He created
not only the annotation system, but also created a small topological sort
work-flow manager as a proof of concept and reference implementation.  This
work-flow manager is feature deficient relative to Cascading in that it
doesn't restructure map-reduce programs, nor does it provide operations like
group or join.  What is does have is an Apache license and what it would
allow is to easily allow Mahout programs have a workflow capability without
a license nightmare.

The new workflow system is called Riffle (being the little cousin of
Cascading after all) and Chris has produced a Git repository for the code.

The basic way that this works is that it is assumed that there are multiple
process steps and that each process step can be interrogated to get a list
of input and output dependencies.  The methods that return these
dependencies are marked with @DependencyIncoming or @DependencyOutgoing.
The application is responsible for adding and managing these dependencies.
Examples of input and output dependencies might be the names of HDFS files
or directories.  Once these dependencies have been defined, the workflow is
invoked and it starts workflow tasks in an order that guarantees that all of
the inputs for a task are available before it is run.

What do people think about this?  Is it as useful as I think it is?  Did I
not give enough information to even tell?

-- Forwarded message --
From: Chris K Wensel 
Date: Mon, Mar 22, 2010 at 3:24 PM
Subject: riffle
To: Ted Dunning 


OK, lets stick it out there and see what happens.

http://github.com/cwensel/riffle

chris

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Sean Owen
I also second the recommendation to pick one of these ideas and focus
on it first, as it will be a lot more work to get it working,
documented, and tested!

I personally like the idea of getting a distributed SVD-based
recommender into the project. The SVD works quite well and all we have
now is a non-parallel implementation.

However I support Jake's suggestion more that RBM would be most useful
at the moment.

Pick your area, dig in, and we can help you refine your ideas and begin.

Sean


[jira] Commented: (MAHOUT-345) [GSOC] integrate Mahout with Drupal/PHP

2010-03-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848433#action_12848433
 ] 

Sean Owen commented on MAHOUT-345:
--

Sounds like  a great idea. I know nothing of Drupal but ought to know most 
about Mahout and recommenders together. If you're wanting to lead this charge 
to develop and improve your integration, I'd gladly provide input -- this is 
more of a Drupal task than Mahout?

Maybe you can start by talking about what about your current integration could 
be improved. Quality, performance? Talk about the nature of the data -- how do 
the number of users and items tend to relate? I can suggest the ways to 
approach architecting the recommender.

> [GSOC] integrate Mahout with Drupal/PHP
> ---
>
> Key: MAHOUT-345
> URL: https://issues.apache.org/jira/browse/MAHOUT-345
> Project: Mahout
>  Issue Type: Task
>  Components: Website
>Reporter: Daniel Xiaodan Zhou
>
> Drupal is a very popular open source web content management system. It's been 
> widely used in e-commerce sites, media sites, etc. This is a list of famous 
> site using Drupal: 
> http://socialcmsbuzz.com/45-drupal-sites-which-you-may-not-have-known-were-drupal-based-24092008/
> Integrate Mahout with Drupal would greatly increase the impact of Mahout in 
> web systems: any Drupal website can easily use Mahout to make content 
> recommendations or cluster contents.
> I'm a PhD student at University of Michigan, with a research focus on 
> recommender systems. Last year I participated GSOC 2009 with Drupal.org, and 
> developed a recommender system for Drupal. But that module was not as 
> sophisticated as Mahout. And I think it would be nice just to integrate 
> Mahout into Drupal rather than developing a separate Mahout-like module for 
> Drupal.
> Any comments? I can provide more information if people here are interested. 
> Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Jake Mannix
And to provide some options, I can say that I would certainly help, Sisir,
with getting the distributed SVD into the recommender framework.  While
there is not much fundamental "computer science" left to that, there is
a fair amount of high-performance *engineering* left to do in that
direction.

  -jake

On Mon, Mar 22, 2010 at 4:09 PM, Sean Owen  wrote:

> I also second the recommendation to pick one of these ideas and focus
> on it first, as it will be a lot more work to get it working,
> documented, and tested!
>
> I personally like the idea of getting a distributed SVD-based
> recommender into the project. The SVD works quite well and all we have
> now is a non-parallel implementation.
>
> However I support Jake's suggestion more that RBM would be most useful
> at the moment.
>
> Pick your area, dig in, and we can help you refine your ideas and begin.
>
> Sean
>


Stochastic SVD

2010-03-22 Thread Dmitriy Lyubimov
Hi all,

i had a chance to touch base quickly with Ted Dunning last weekend at the
Bay Area machine learning camp. It's my understanding the main advantage of
this method is that partial SVD can be achieved in a constant # of MR jobs
(Ted's analysis seemed to imply that number would be 4) .

've been following Mahout for perhaps couple of  months and read the book
(first 6 chapters of it anyway) in MEA, and that's about it. But i have a
great interest in all the work happening in this project.


While it my be the case that our particular business problem at the time may
be addressed by running single-node iterative svd (such as lanczos
iterative, one of lapack's methods), it is highly likely it will not be the
case for too long. We also use Hadoop and ecosystem for our platform, so
mahout comes naturally into picture (whereas MPI does not).

Anyway, starting the next week, i will have to spend time on that business
need, and my boss seems to be happy if i have a chance to contribute part of
my time and results to Mahout (i guess he also expects results as well...
eventually :-) ) . The paper seems to be the one in the issue MAHOUT-309, i
skimmed it a little bit and i guess i have some questions in regards to
Ted's clarifications as given at the camp this weekend and this paper (if it
is even the right one).

I guess i do need some guidance if i am to do this and i am wondering if my
effort is welcome (provided i need some guidance on some details of Mahout
and the algorithms there). I guess my selfish desire is to escalate method
availability in Mahout.

Thank you very much.
-Dmitriy


Re: Stochastic SVD

2010-03-22 Thread Jake Mannix
Hi Dmitriy,

  Stochastic SVD is high on my list of pieces to get into Mahout as
well, but is partly dependent on getting some of Ted's murmurhash stuff
from the SGD work he's got sitting idle in a patch on MAHOUT-228.

  If you could help get MAHOUT-228 finished and put in trunk, we could
quickly move forward on MAHOUT-309.  I think this can be done in
possibly only 2 MR passes, but we can chat about that a bit more
as we dig into it. :)

  -jake

On Mon, Mar 22, 2010 at 4:33 PM, Dmitriy Lyubimov  wrote:

> Hi all,
>
> i had a chance to touch base quickly with Ted Dunning last weekend at the
> Bay Area machine learning camp. It's my understanding the main advantage of
> this method is that partial SVD can be achieved in a constant # of MR jobs
> (Ted's analysis seemed to imply that number would be 4) .
>
> 've been following Mahout for perhaps couple of  months and read the book
> (first 6 chapters of it anyway) in MEA, and that's about it. But i have a
> great interest in all the work happening in this project.
>
>
> While it my be the case that our particular business problem at the time
> may
> be addressed by running single-node iterative svd (such as lanczos
> iterative, one of lapack's methods), it is highly likely it will not be the
> case for too long. We also use Hadoop and ecosystem for our platform, so
> mahout comes naturally into picture (whereas MPI does not).
>
> Anyway, starting the next week, i will have to spend time on that business
> need, and my boss seems to be happy if i have a chance to contribute part
> of
> my time and results to Mahout (i guess he also expects results as well...
> eventually :-) ) . The paper seems to be the one in the issue MAHOUT-309, i
> skimmed it a little bit and i guess i have some questions in regards to
> Ted's clarifications as given at the camp this weekend and this paper (if
> it
> is even the right one).
>
> I guess i do need some guidance if i am to do this and i am wondering if my
> effort is welcome (provided i need some guidance on some details of Mahout
> and the algorithms there). I guess my selfish desire is to escalate method
> availability in Mahout.
>
> Thank you very much.
> -Dmitriy
>


Re: Stochastic SVD

2010-03-22 Thread Dmitriy Lyubimov
Thank you Jake.

i guess i did not have a chance to read thru 228. I am guessing 228  is
something quite different from murmur hash support in hadoop, right?

i will read thru it though and i guess i may come back with more questions.

Thanks.
-Dmitriy

On Mon, Mar 22, 2010 at 4:38 PM, Jake Mannix  wrote:

> Hi Dmitriy,
>
>  Stochastic SVD is high on my list of pieces to get into Mahout as
> well, but is partly dependent on getting some of Ted's murmurhash stuff
> from the SGD work he's got sitting idle in a patch on MAHOUT-228.
>
>  If you could help get MAHOUT-228 finished and put in trunk, we could
> quickly move forward on MAHOUT-309.  I think this can be done in
> possibly only 2 MR passes, but we can chat about that a bit more
> as we dig into it. :)
>
>  -jake
>
> On Mon, Mar 22, 2010 at 4:33 PM, Dmitriy Lyubimov 
> wrote:
>
> > Hi all,
> >
> > i had a chance to touch base quickly with Ted Dunning last weekend at the
> > Bay Area machine learning camp. It's my understanding the main advantage
> of
> > this method is that partial SVD can be achieved in a constant # of MR
> jobs
> > (Ted's analysis seemed to imply that number would be 4) .
> >
> > 've been following Mahout for perhaps couple of  months and read the book
> > (first 6 chapters of it anyway) in MEA, and that's about it. But i have a
> > great interest in all the work happening in this project.
> >
> >
> > While it my be the case that our particular business problem at the time
> > may
> > be addressed by running single-node iterative svd (such as lanczos
> > iterative, one of lapack's methods), it is highly likely it will not be
> the
> > case for too long. We also use Hadoop and ecosystem for our platform, so
> > mahout comes naturally into picture (whereas MPI does not).
> >
> > Anyway, starting the next week, i will have to spend time on that
> business
> > need, and my boss seems to be happy if i have a chance to contribute part
> > of
> > my time and results to Mahout (i guess he also expects results as well...
> > eventually :-) ) . The paper seems to be the one in the issue MAHOUT-309,
> i
> > skimmed it a little bit and i guess i have some questions in regards to
> > Ted's clarifications as given at the camp this weekend and this paper (if
> > it
> > is even the right one).
> >
> > I guess i do need some guidance if i am to do this and i am wondering if
> my
> > effort is welcome (provided i need some guidance on some details of
> Mahout
> > and the algorithms there). I guess my selfish desire is to escalate
> method
> > availability in Mahout.
> >
> > Thank you very much.
> > -Dmitriy
> >
>


Re: Stochastic SVD

2010-03-22 Thread Ted Dunning
It may also be possible to get by with only one MR pass if you only need to
compute reduced dimensional representations for new data.

On Mon, Mar 22, 2010 at 4:38 PM, Jake Mannix  wrote:

>  If you could help get MAHOUT-228 finished and put in trunk, we could
> quickly move forward on MAHOUT-309.  I think this can be done in
> possibly only 2 MR passes, but we can chat about that a bit more
> as we dig into it. :)
>


Re: Stochastic SVD

2010-03-22 Thread Jake Mannix
Well, it takes one pass all the way to the reducers to get the small matrix
with which the SVD will be computed, so I'm not sure how you'd project the
original data onto that SVD in the same pass...

I guess if you mean just do a random projection on the original data, you
can certainly do that in one pass, but that's random projection, not a
stochastic decomposition.

  -jake

On Mon, Mar 22, 2010 at 5:54 PM, Ted Dunning  wrote:

> It may also be possible to get by with only one MR pass if you only need to
> compute reduced dimensional representations for new data.
>
> On Mon, Mar 22, 2010 at 4:38 PM, Jake Mannix 
> wrote:
>
> >  If you could help get MAHOUT-228 finished and put in trunk, we could
> > quickly move forward on MAHOUT-309.  I think this can be done in
> > possibly only 2 MR passes, but we can chat about that a bit more
> > as we dig into it. :)
> >
>


Re: Stochastic SVD

2010-03-22 Thread Ted Dunning
You are probably right.  I had a wild hare tromp through my thoughts the
other day saying that one pass should be possible, but I can't reconstruct
the details just now.

On Mon, Mar 22, 2010 at 6:00 PM, Jake Mannix  wrote:

> I guess if you mean just do a random projection on the original data, you
> can certainly do that in one pass, but that's random projection, not a
> stochastic decomposition.
>


Re: Progress with Eclipse

2010-03-22 Thread Drew Farris
I checked out a clean trunk and managed to get as far as step 6
without problems. 6 failed trying to pull the collections-codegen
plugin 0.4-SNAPSHOT from a repo. I'm not sure why the reactor is not
picking it up. Nevertheless, I was able to get past this by running a
separate mvn clean install to push the plugin to my local repo.

When all is said and done, I get a bunch of errors from
mahout-collections complaining about the @Overrides. Switching
compliance to 1.6 of course eliminates these.

The .checkstyle, .pmd and .ruleset files under each project cause svn
to show changes. We should set svn:ignore on these -- or should they
be checked in?

Without getting too specific it strikes me that some of the stuff that
PMD is complaning about should be downgraded from 'Error high' or
'Error' to some level of warning. For example it appears that all of
the modules have errors of some level based on the existing (default?)
settings.

FWIW, here's my environment:
mvn --version
Apache Maven 2.2.1 (r801777; 2009-08-06 15:16:01-0400)
Java version: 1.6.0_16
Java home: /u01/opt/jdk1.6.0_16/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux" version: "2.6.28-18-server" arch: "i386" Family: "unix"

On Mon, Mar 22, 2010 at 9:35 AM, Benson Margulies  wrote:
> I need a brave eclipse user to try out the following, and then I have
> some questions for others.
>
> 1) Remove all .project and .classpath files from your tree.
> 2) cd to 'eclipse'.
> 3) Pick a new pathname for an eclipse workspace (call it WORKSPACE) in
> the following.
> 4) mvn -Psetup-eclipse-workspace -Declipse.workspace.dir=WORKSPACE
>
> This much will create WORKSPACE, copy some files into it, and set some
> global options.
>
> 5) cd .. (to the mahout top)
> 6) mvn -Psetup.eclipse -Declipse.workspace=WORKSPACE
>
> Now there will be .project and .classpath files.
>
> 7) start eclipse, select WORKSPACE
> 8) import projects from the mahout toplevel
>
> If all goes well, you will be presented with a lot of PMD complaints.
> I turned on PMD as part of the show, and it seems that we have a
> supply of PMD non-compliance. What do people think about the PMD rules
> we have checked in? Do we want to conform to them?
>


Re: Progress with Eclipse

2010-03-22 Thread Benson Margulies
I think an initial build is needed.

The files should not be checked in -- Eclipse requires one-per-project
and the magic dust is pushing copies from a central location. Ignores
are fine. I have blankets for those files in my personal settings so I
didn't spot the issue.

I don't know the parentage of the PMD file that was checked in. It
might even have come from me a long time ago. I do hate warnings, so
my view, personally, is that we either keep+fix or remove.

Is anyone else interested in wrestling with PMD.


On Mon, Mar 22, 2010 at 9:41 PM, Drew Farris  wrote:
> I checked out a clean trunk and managed to get as far as step 6
> without problems. 6 failed trying to pull the collections-codegen
> plugin 0.4-SNAPSHOT from a repo. I'm not sure why the reactor is not
> picking it up. Nevertheless, I was able to get past this by running a
> separate mvn clean install to push the plugin to my local repo.
>
> When all is said and done, I get a bunch of errors from
> mahout-collections complaining about the @Overrides. Switching
> compliance to 1.6 of course eliminates these.
>
> The .checkstyle, .pmd and .ruleset files under each project cause svn
> to show changes. We should set svn:ignore on these -- or should they
> be checked in?
>
> Without getting too specific it strikes me that some of the stuff that
> PMD is complaning about should be downgraded from 'Error high' or
> 'Error' to some level of warning. For example it appears that all of
> the modules have errors of some level based on the existing (default?)
> settings.
>
> FWIW, here's my environment:
> mvn --version
> Apache Maven 2.2.1 (r801777; 2009-08-06 15:16:01-0400)
> Java version: 1.6.0_16
> Java home: /u01/opt/jdk1.6.0_16/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux" version: "2.6.28-18-server" arch: "i386" Family: "unix"
>
> On Mon, Mar 22, 2010 at 9:35 AM, Benson Margulies  
> wrote:
>> I need a brave eclipse user to try out the following, and then I have
>> some questions for others.
>>
>> 1) Remove all .project and .classpath files from your tree.
>> 2) cd to 'eclipse'.
>> 3) Pick a new pathname for an eclipse workspace (call it WORKSPACE) in
>> the following.
>> 4) mvn -Psetup-eclipse-workspace -Declipse.workspace.dir=WORKSPACE
>>
>> This much will create WORKSPACE, copy some files into it, and set some
>> global options.
>>
>> 5) cd .. (to the mahout top)
>> 6) mvn -Psetup.eclipse -Declipse.workspace=WORKSPACE
>>
>> Now there will be .project and .classpath files.
>>
>> 7) start eclipse, select WORKSPACE
>> 8) import projects from the mahout toplevel
>>
>> If all goes well, you will be presented with a lot of PMD complaints.
>> I turned on PMD as part of the show, and it seems that we have a
>> supply of PMD non-compliance. What do people think about the PMD rules
>> we have checked in? Do we want to conform to them?
>>
>


Re: Progress with Eclipse

2010-03-22 Thread Drew Farris
On Mon, Mar 22, 2010 at 9:46 PM, Benson Margulies  wrote:
>
> The files should not be checked in -- Eclipse requires one-per-project
> and the magic dust is pushing copies from a central location. Ignores
> are fine. I have blankets for those files in my personal settings so I
> didn't spot the issue.

svn:ignore's committed.

> I don't know the parentage of the PMD file that was checked in. It
> might even have come from me a long time ago. I do hate warnings, so
> my view, personally, is that we either keep+fix or remove.

Yes, I don't like warnings either, but until these are fixed or
removed warnings are better than errors. I don't think all of PMD's
gripes are worth ignoring (e.g: method names starting with capital
letters). I''ve never used PMD before, so I'm likely to be of little
help revising the existing settings.


Re: Stochastic SVD

2010-03-22 Thread Jake Mannix
Actually, maybe what you were thinking (at least, what *I* am thinking) is
that you can indeed do it on one pass through the *original* data (ie you
can
get away with never keeping a handle on the original data itself), because
on the "one pass" through that data, you spit out MultipleOutputs - one
SequenceFile of the randomly projected data, which doesn't hit a reducer
at all, and a second output which is the outer product of those vectors
with themselves, which its a summing reducer.

In this sense, while you need to pass over the original data's *size*
(in terms of number of rows) a second time, if you want to consider
it data to be played with (instead of just "training" data for use on a
smaller subset or even totally different set), you don't need to pass
over the original entire data *set* ever again.

  -jake

On Mon, Mar 22, 2010 at 6:35 PM, Ted Dunning  wrote:

> You are probably right.  I had a wild hare tromp through my thoughts the
> other day saying that one pass should be possible, but I can't reconstruct
> the details just now.
>
> On Mon, Mar 22, 2010 at 6:00 PM, Jake Mannix 
> wrote:
>
> > I guess if you mean just do a random projection on the original data, you
> > can certainly do that in one pass, but that's random projection, not a
> > stochastic decomposition.
> >
>


[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

2010-03-22 Thread Jake Mannix (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-228:
---

Attachment: MAHOUT-228.patch

*bump*

I think this is now the third time I'd brought this patch up-to-date.  
Compiles, but internal tests don't pass.  Not sure why, as I haven't dug into 
them too deeply.

Ted, or anyone else with a desire to get Vowpal-Wabbit-style awesomeness in 
Mahout, want to take this patch for a spin and see what is up with it?

Or if you, Ted, don't have time to finish it yourself, could you at least check 
this patch out, and document a little about what the rest of us need to do to 
get this up running (and verified as working)?

> Need sequential logistic regression implementation using SGD techniques
> ---
>
> Key: MAHOUT-228
> URL: https://issues.apache.org/jira/browse/MAHOUT-228
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Reporter: Ted Dunning
> Fix For: 0.4
>
> Attachments: logP.csv, MAHOUT-228-3.patch, MAHOUT-228.patch, r.csv, 
> sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable 
> learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a 
> reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

2010-03-22 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848581#action_12848581
 ] 

Ted Dunning commented on MAHOUT-228:


{quote}
Or if you, Ted, don't have time to finish it yourself, could you at least check 
this patch out, and document a little about what the rest of us need to do to 
get this up running (and verified as working)?
{quote}
That only sounds fair given what you have done so far.

Let me dig in tomorrow.



> Need sequential logistic regression implementation using SGD techniques
> ---
>
> Key: MAHOUT-228
> URL: https://issues.apache.org/jira/browse/MAHOUT-228
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Reporter: Ted Dunning
> Fix For: 0.4
>
> Attachments: logP.csv, MAHOUT-228-3.patch, MAHOUT-228.patch, r.csv, 
> sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable 
> learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a 
> reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.