Re: Google Summer of Code: Bring out your projects

2010-03-12 Thread Grant Ingersoll

On Mar 12, 2010, at 1:22 AM, Robin Anil wrote:

> Shall I go and put some of the ideas up. I will do it as a whole for the
> project. Later we can re-assign things maybe ? How does that sound? Unlike
> other projects we cant really go an put a proposal like "Implement
> back-propagation" and expect a student to take it up and reduce things to
> map/reduce.
> 
> Some of the ideas (i am going to be really ambitious/vague here, but write
> clear expectations or guidelines on what is an ideal proposal)
> 
> 1) Implement a cool classifier over map/reduce
> 2) Implement a cool clustering algorithm on map/reduce
> 3) Implement a meta-learner to plugin to various classifiers in mahout and
> have bagging, boosting support.
> 4) Continuous performance benchmarking/dashboard maybe wrappers over EC2
> 5) Create a matrix implementations of MYSQL and NOSQL(hbase, cassandra)
> access for all the algorithms to use.
> 6) Implement some of the ideas from Netflix top 5 to boost recommendations
> packge
> 7) Visualization tool for clustering, classification or recommendation.
> ability to explain(optional)
> 8) Improve mahout-math package

9. Implement M/R Tika integration to take "rich" documents on HDFS and output 
Vectors.   Likely not a full Summer of Work there, but could be part of some 
larger "Utils" capabilities focused on making it easier to consume Mahout.  
Also included: Finish ARFF compatibility.  
10. Benchmark.  Break the record?

I think we should still solicit ideas on list here that we can put up on JIRA.

> 
> 
> Who is free to mentor this year?  i.e giving 5-6 hours weekly to a student
> and hear then crib(sorry ian and isabel :P) and give words of encouragement.
> And yes, code reviews.

I'm in.

Re: Google Summer of Code: Bring out your projects

2010-03-11 Thread Robin Anil
Shall I go and put some of the ideas up. I will do it as a whole for the
project. Later we can re-assign things maybe ? How does that sound? Unlike
other projects we cant really go an put a proposal like "Implement
back-propagation" and expect a student to take it up and reduce things to
map/reduce.

Some of the ideas (i am going to be really ambitious/vague here, but write
clear expectations or guidelines on what is an ideal proposal)

1) Implement a cool classifier over map/reduce
2) Implement a cool clustering algorithm on map/reduce
3) Implement a meta-learner to plugin to various classifiers in mahout and
have bagging, boosting support.
4) Continuous performance benchmarking/dashboard maybe wrappers over EC2
5) Create a matrix implementations of MYSQL and NOSQL(hbase, cassandra)
access for all the algorithms to use.
6) Implement some of the ideas from Netflix top 5 to boost recommendations
packge
7) Visualization tool for clustering, classification or recommendation.
ability to explain(optional)
8) Improve mahout-math package


Who is free to mentor this year?  i.e giving 5-6 hours weekly to a student
and hear then crib(sorry ian and isabel :P) and give words of encouragement.
And yes, code reviews.

Robin


Re: Google Summer of Code Proposal Submission

2009-03-27 Thread Philip Ramsey
Grant,

Thank you very much for the feedback! I'll make those changes and
elaborations to my proposal very soon.
Our thinking with the bi-grams is that, if we can maintain a relatively low
error-rate in computing sets of similar words at a grassroots level, then we
can have a powerful base case for inferring grammars on n-gram strings. But
we haven't started thinking in great detail about how we'll graduate to a
top-down parse. A lot of our current work is building off of research done
by Lillian Lee over at Cornell. Their research, in a restricted test space
of transitive verb->object noun pairs, looked at the benefits and
limitations of a number of different similarity/distance measures. Based on
their results, we've generalized the test space to an entire raw text
corpus, and have been tweaking their measures to get better scores and be
optimal over a cluster.

Again, thanks for the feedback,
Philip

On Thu, Mar 26, 2009 at 1:11 AM, Grant Ingersoll wrote:

> Hi Philip,
>
> Thanks for the proposal.  Sounds interesting.  For the proposal that you
> submit, you should make sure to add references, details on how you plan to
> implement, etc.  Of course, no need to do that in great depth on the wiki.
>
> Also, have you looked at going beyond just bi-grams?  Not sure if it makes
> sense or not, but was just curious.   Also, you should have a look at the
> Watchmaker stuff that is in Mahout already and maybe be able to address how
> what you are proposing relates.
>
> -Grant
>
>
> On Mar 25, 2009, at 7:37 PM, Philip Ramsey wrote:
>
>  Hello Folks,
>>
>> I'm a student at The Evergreen State College and yesterday I submitted a
>> proposal for the GSoC project to the wiki. I'm sending a link to my
>> submission, with hopes that some of you might have feedback or questions
>> or
>> advice:
>>
>>
>> http://wiki.apache.org/general/SoC2009/PhilipRamsey-Mahout-AlgorithmsProposal
>>
>> Thanks a lot,
>> Philip Ramsey
>> goal.oriented.des...@gmail.com
>>
>
>


Re: Google Summer of Code Proposal Submission

2009-03-26 Thread Grant Ingersoll

Hi Philip,

Thanks for the proposal.  Sounds interesting.  For the proposal that  
you submit, you should make sure to add references, details on how you  
plan to implement, etc.  Of course, no need to do that in great depth  
on the wiki.


Also, have you looked at going beyond just bi-grams?  Not sure if it  
makes sense or not, but was just curious.   Also, you should have a  
look at the Watchmaker stuff that is in Mahout already and maybe be  
able to address how what you are proposing relates.


-Grant

On Mar 25, 2009, at 7:37 PM, Philip Ramsey wrote:


Hello Folks,

I'm a student at The Evergreen State College and yesterday I  
submitted a

proposal for the GSoC project to the wiki. I'm sending a link to my
submission, with hopes that some of you might have feedback or  
questions or

advice:

http://wiki.apache.org/general/SoC2009/PhilipRamsey-Mahout-AlgorithmsProposal

Thanks a lot,
Philip Ramsey
goal.oriented.des...@gmail.com




Re: Google Summer of Code

2008-04-22 Thread Grant Ingersoll
Also, have a look at: http://www.apache.org/dev/ for more info.  It  
would be helpful if all people (esp. GSOCers) who plan on contributing  
code file a CLA (http://www.apache.org/licenses/#clas) although it is  
not explicitly required, just makes things a bit nicer for us on the  
legal side.


-Grant

On Apr 22, 2008, at 7:58 AM, Grant Ingersoll wrote:


Welcome aboard!

We had a lot of very nice proposals, including a couple that were,  
unfortunately just below the cutoff.  We (the ASF) had originally  
hoped to get more slots from Google, but they had an even bigger  
response from other projects as well.  As it were, Mahout alone had  
something like 15 applicants, most of which were high quality and  
well-thought out.  For those who didn't get selected, please do feel  
welcome here with the rest of us "volunteers"  :-).


To those accepted, do try to keep in mind that we should keep  
project discussions on the list.  I think it is fine to ask mentors  
questions in private related to administrative stuff, but if you  
have questions about how to code something, etc. those are best  
handled on this list, as it creates a history and allows others to  
understand design decisions, etc.


Cheers,
Grant


On Apr 21, 2008, at 10:28 PM, Robin Anil wrote:


Hi Everyone,
This is one of those days where I wake up and see  
that I
have got accepted to GSoc with Mahout (:32-all-out:) . I am really  
excited
to kick start the work. I know I have a lot to understand in terms  
of coding
practices, the whole workflow/process. And i would like to  
congratulate and
say hi to my fellow Gsoc'ers Farid, Yun and Abdel,  Hi to my mentor  
Ian

Holsman and to rest of the community.

I am usually online of google talk: if you use it do add me:
[EMAIL PROTECTED]

Cheers and Good Day
Robin








Re: Google Summer of Code

2008-04-22 Thread Grant Ingersoll

Welcome aboard!

We had a lot of very nice proposals, including a couple that were,  
unfortunately just below the cutoff.  We (the ASF) had originally  
hoped to get more slots from Google, but they had an even bigger  
response from other projects as well.  As it were, Mahout alone had  
something like 15 applicants, most of which were high quality and well- 
thought out.  For those who didn't get selected, please do feel  
welcome here with the rest of us "volunteers"  :-).


To those accepted, do try to keep in mind that we should keep project  
discussions on the list.  I think it is fine to ask mentors questions  
in private related to administrative stuff, but if you have questions  
about how to code something, etc. those are best handled on this list,  
as it creates a history and allows others to understand design  
decisions, etc.


Cheers,
Grant


On Apr 21, 2008, at 10:28 PM, Robin Anil wrote:


Hi Everyone,
 This is one of those days where I wake up and see  
that I
have got accepted to GSoc with Mahout (:32-all-out:) . I am really  
excited
to kick start the work. I know I have a lot to understand in terms  
of coding
practices, the whole workflow/process. And i would like to  
congratulate and
say hi to my fellow Gsoc'ers Farid, Yun and Abdel,  Hi to my mentor  
Ian

Holsman and to rest of the community.

I am usually online of google talk: if you use it do add me:
[EMAIL PROTECTED]

Cheers and Good Day
Robin





Re: Google Summer of Code

2008-04-21 Thread Isabel Drost
On Tuesday 22 April 2008, deneche abdelhakim wrote:
> So we are four students, that's cool. I wish us good work and great fun in
> this summer.

I am really happy, we received a few slots more than expected.  Welcome to the 
Mahout project to both of you and congratulations to the successful GSoC 
application. I wish you a lot of fun, working on your proposed topics and 
hope that all students can finish their work successfully. I think not only 
your individual mentors will help you, but as usual in Apache land the whole 
community will be happy to work with you on the mailing list.

There were quite a few applications that unfortunately were not accepted. I 
would like to invite those who did not get selected to stick around and 
contribute. As Mahout is still young, it is especially easy to make a 
difference. So if you are interested in machine learning, we would be happy 
to welcome you here, even if Google does not sponsor your summer.

Isabel

-- 
Imbalance of power corrupts and monopoly of power corrupts absolutely.  
-- 
Genji
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


RE : Google Summer of Code

2008-04-21 Thread deneche abdelhakim
Hi Robin, 

I am very happy that I've been accepted, thanks to the Mahout Community that 
kindly commented on my draft.

So we are four students, that's cool. I wish us good work and great fun in this 
summer.

Hakim


Robin Anil <[EMAIL PROTECTED]> a écrit : Hi Everyone,
  This is one of those days where I wake up and see that I
have got accepted to GSoc with Mahout (:32-all-out:) . I am really excited
to kick start the work. I know I have a lot to understand in terms of coding
practices, the whole workflow/process. And i would like to congratulate and
say hi to my fellow Gsoc'ers Farid, Yun and Abdel,  Hi to my mentor Ian
Holsman and to rest of the community.

I am usually online of google talk: if you use it do add me:
[EMAIL PROTECTED]

Cheers and Good Day
Robin


 __
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail 

Re: Google Summer of Code

2008-03-25 Thread Isabel Drost
On Tuesday 25 March 2008, Marko Novakovic wrote:
> I understood kMeans algorithm from paper related to
> this project.
> Where could I find this code?

Just have a look into the svn. The url to access it are on the website.


> Does it exist any algotithm, which is not realized?

I think you can always have a look into our JIRA to find out about which parts 
are currently being worked on.

Isabel


-- 
Inform all the troops that communications have completely broken down.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-25 Thread Marko Novakovic
I understood kMeans algorithm from paper related to
this project.
Where could I find this code?
I could realize whatever algorithm, it needn't be
kMeans. Which algorithm is more interesing to realize
for you? Does it exist any algotithm, which is not
realized?

My mentor from my faculty said to me that clustering
was the most interesting for him. I will consult with
him tomorrow to understand deployment of clustering to
his Search Engine.

Does Apache have any conventions for writting student
abstracts and description in application? And which is
crteria for student selection?

Best Regards

--- Isabel Drost <[EMAIL PROTECTED]>
wrote:

> On Tuesday 25 March 2008, Marko Novakovic wrote:
> > I attached beta version of presentation.
> > I must consult with mentor form my college to
> examine
> > exact which the role of clusterin is in this
> system.
> 
> Hmm, one of the slides talks about using the
> clustering algorithm to identify 
> new topics. I guess I still do not get the full
> picture.
> 
> Did you happen to have a chance to look at the
> k-Means code in the repository 
> yet?
> 
> Isabel
> 
> -- 
> I don't mind arguing with myself.  It's when I lose
> that it bothers me.   -- 
> Richard Powers
>   |\  _,,,---,,_   Web:  
> 
>   /,`.-'`'-.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM: 
> 
> 



  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


Re: Google Summer of Code

2008-03-25 Thread Isabel Drost
On Tuesday 25 March 2008, Marko Novakovic wrote:
> I attached beta version of presentation.
> I must consult with mentor form my college to examine
> exact which the role of clusterin is in this system.

Hmm, one of the slides talks about using the clustering algorithm to identify 
new topics. I guess I still do not get the full picture.

Did you happen to have a chance to look at the k-Means code in the repository 
yet?

Isabel

-- 
I don't mind arguing with myself.  It's when I lose that it bothers me. 
-- 
Richard Powers
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-24 Thread Isabel Drost
On Tuesday 25 March 2008, Marko Novakovic wrote:
> Other components will be clasifier, crawler and
> indexer.

So it will be the typical setup: Crawl web pages, classify them as positive or 
negative and in the end index them correctly? I would be especially 
interested in how the classifier will be build - as far as you can share any 
such knowledge on a public mailing list before September '08.


> I have idea about architecture in which all 
> components will be run at each machine.

I think the system architecture was pretty clear from the slides you sent. I 
would be nice if you could briefly sketch them on list as the slides have not 
survived being sent to a mailing list :)


> My idea for clustering would be making relevance by
> properties, like repetition keywods on page, relevant
> tags, keyword in subject etc. For each property will
> be allocated one axis and from n-dimensional space
> clustering machine will group pages by proper
> algrithm, in my case k-Means.

If I understood the task correctly the goal is to build a system that is 
capable of separating posts that express some opinion from objective ones and 
afterwards to group positive vs. negative postings, right?

I do not yet see, how the clustering algorithm k-means helps you achieve this 
task.


> If you want I will be able to describe detailed
> relevance for clustering with proper examples
> tomorrow.

Sounds good.

Isabel


-- 
"Life sucks, but death doesn't put out at all"  -- Thomas J. 
Kopp
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-24 Thread Isabel Drost
On Tuesday 25 March 2008, Josh Harguess wrote:
> I have completed an application for Google Summer of Code for the
> implementation of the PCA algorithm in Mahout.  My research is directly
> related to the use of PCA, so I am very familiar with that algorithm.

Great!


> However, since I work in the area of pattern recognition and machine
> learning, I am also familiar with the other algorithms listed on your
> site. Since there was not a ranking of desired algorithms, I chose PCA, 
> but if there is a more immediate need for a different algorithm, I can
> most likely help with that instead / as well.

I think it is fine to choose the algorithm you are most familiar with. 
Currently we are happy to have someone who takes care of any of the 
algorithms. As you have experience with any of the algorithms, you could also 
contribute by taking part in the discussions on the mailing lists.

Isabel


-- 
Only God can make random selections.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-24 Thread Marko Novakovic
The cluster will be one component at search engine.
Other components will be clasifier, crawler and
indexer. I have idea about architecture in which all
components will be run at each machine.
Weba pages will be sent to cpu-s by hash function,
which will be variable depending on inserting new or
disposing or damaging working cpu-s.
Between the crawler and the other part of system will
be queue, from which will be scheduled pages by hash.

My idea for clustering would be making relevance by
properties, like repetition keywods on page, relevant
tags, keyword in subject etc. For each property will
be allocated one axis and from n-dimensional space
clustering machine will group pages by proper
algrithm, in my case k-Means.
If you want I will be able to describe detailed
relevance for clustering with proper examples
tomorrow.

Greetings

--- Isabel Drost <[EMAIL PROTECTED]>
wrote:

> On Monday 24 March 2008, Marko Novakovic wrote:
> > and I am interesting to implement this clustering
> > algorithm at Handop platform.
> 
> So you would like to get a distributed clustering
> algorithm for grouping 
> search results? It would be nice to hear more about
> your approach to this 
> problem. 
> 
> There are a few guys here who have been working on
> clustering search results 
> already. I guess they might be able to provide some
> help as well.
> 
> We already have a k-Means implementation, but so far
> it has not been 
> integrated into a search result clustering context.
> 
> Isabel
> 
> -- 
> Science is what happens when preconception meets
> verification.
>   |\  _,,,---,,_   Web:  
> 
>   /,`.-'`'-.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM: 
> 
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Google Summer of Code

2008-03-24 Thread Isabel Drost
On Monday 24 March 2008, Marko Novakovic wrote:
> and I am interesting to implement this clustering
> algorithm at Handop platform.

So you would like to get a distributed clustering algorithm for grouping 
search results? It would be nice to hear more about your approach to this 
problem. 

There are a few guys here who have been working on clustering search results 
already. I guess they might be able to provide some help as well.

We already have a k-Means implementation, but so far it has not been 
integrated into a search result clustering context.

Isabel

-- 
Science is what happens when preconception meets verification.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google summer of code & mahout-machine-learning

2008-03-24 Thread Isabel Drost
On Wednesday 19 March 2008, Frédéric wrote:
> Hello,
>
> I am a french student, currently studying distributed systems in Finland.

Sounds interesting. What are you working on?


> To be honest I don't know all the algorithms listed in the paper.

I think it is sufficient to either know at least one of them enough to work on 
a scalable, at best parallel version of it. Another option that I consider 
interesting is to look for some real world problem one would like to solve 
with machine learning and to work on the solution.


> Unfortunately, I have some exams to take this week and I'm sorry for
> not having enough time to give you more details. But I will give you
> more informations about my ideas and my skills related to this project
> as soon as possible.

Looking forward to reading more about your ideas.

Isabel

-- 
The two things that can get you into trouble quicker than anything else are 
fast women and slow horses.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google summer of code & mahout-machine-learning

2008-03-19 Thread Grant Ingersoll

Sounds good.

On Mar 19, 2008, at 7:51 AM, Frédéric wrote:


Hello,

I am a french student, currently studying distributed systems in  
Finland.

I'm also a Lucene user and I find the subject about the Map/Reduce
enabled machine learning algorithm pretty interesting.

To be honest I don't know all the algorithms listed in the paper, but
I've ever implemented genetic algorithms and artificial neural
networks for both my academic and personal projects.
I have a good knowledge of Java and I have ever done a successful
Google summer of code last year.

Unfortunately, I have some exams to take this week and I'm sorry for
not having enough time to give you more details. But I will give you
more informations about my ideas and my skills related to this project
as soon as possible.

Cheers

Frédéric Rechtenstein





RE: Google Summer of Code[esp. More Clustering]

2008-03-11 Thread Jeff Eastman
Hi Matthew,

I've implemented a minimal, non-MR version of the algorithm below to see how
it would behave. The operant code is in TestMeanShift.testMeanShift() and
MeanShiftCanopy.mergeCanopy(). The rest of the MR classes are stuff I copied
from Canopy so you can ignore them.

The TestMeanShift.setUp() method builds a 100x3 point matrix that represents
a 10x10 image with the diagonal intensified (i.e. a '\' character mask).
Then testMeanShift()creates an initial set of 100 canopies from it and
iterates over the canopies merging their centroids into a new set of
canopies until the canopy list size does not change any more. Finally it
prints out the canopies that were found for each cell in the original image.

Every time two canopies come within T2 distance of each other they merge,
reducing the number of canopies. The original points that were bound to each
canopy are also merged so that, at the end of the iteration, the original
points are available in their respective canopies.

Depending upon the values chosen for T1 and T2, the process either converges
quickly or slowly. The loop terminates before actual convergence is
achieved, but it does seem to cluster the input coherently.

I hesitate to call this MeanShift but it is something similar that follows
the same general algorithm, as I understand it at least. I hope you find it
interesting.

Jeff


> -Original Message-
> From: Jeff Eastman [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 10, 2008 9:09 PM
> To: mahout-dev@lucene.apache.org
> Subject: RE: Google Summer of Code[esp. More Clustering]
> 
> Hi Matthew,
> 
> I'd like to pursue that canopy thought a little further and mix it in with
> your sub sampling idea. Optimizing can come later, once we figure out how
> to
> do mean-shift in M/R at all. How about this?
> 
> 1. Each mean-shift iteration consists of a canopy clustering of all the
> points, with T1 set to the desired sampling resolution (h?) and T2 set to
> 1.
> This will create one canopy centered on each point in the input set which
> contains all of its neighbors that are close enough to influence its next
> position in its trajectory (the window?).
> 
> 2. We then calculate the centroid of each canopy (that's actually done
> already by the canopy cluster reducer). Is this centroid not also the
> weighted average you desire for the next location of the point at its
> center?
> 
> 3. As the computation proceeds, the canopies will collapse together as
> their
> various centroids move inside the T2=1 radius. At the point when all
> points
> have converged, the remaining canopies will be the mean-shift clusters
> (modes?) of the dataset and their contents will be the migrated points in
> each cluster.
> 
> 4. If each original point is duplicated as its own payload, then the
> iterations will produce clusters of migrated points whose payloads are the
> final contents of each cluster.
> 
> Can you wrap your mind around this enough to validate my assumptions?
> 
> Jeff
> 
> > -Original Message-----
> > From: Matthew Riley [mailto:[EMAIL PROTECTED]
> > Sent: Monday, March 10, 2008 5:58 PM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: Google Summer of Code[esp. More Clustering]
> >
> > Hi Jeff-
> >
> > I think your "basin of attraction" understanding is right on. I also
> like
> > your ideas for distributing the mean-shift iterations by following a
> > canopy-style method. My intuition was a little different, and I would
> like
> > to hear your ideas on it:
> >
> > Just to make sure we're on the same page
> > Say we have 1 million point in our original dataset, and we want to
> > cluster
> > by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
> > points from the original dataset and follow the gradient of those points
> > to
> > the region of highest density (and as we saw from the paper, rather than
> > calculate the gradient itself we can equivalently compute the weighted
> > average of our subsampled points and move the centroid to that point).
> > This
> > part seems fairly straightforward to distribute - we just send a
> different
> > subsampled set to each processor and each processor returns the final
> > centroid for that set.
> >
> > The problem I see is that 10,000 points (or whatever value we choose),
> may
> > be too much for a single processor if we have to compute the distance to
> > every single point when we compute the weighted mean. My thought here
> was
> > to
> > exploit the fact that we're using a kernel function (gaussian, uniform,
> > etc.) in the weighted mean calculation and that kernel will have a set
&g

RE: Google Summer of Code[esp. More Clustering]

2008-03-10 Thread Jeff Eastman
Hi Matthew,

I'd like to pursue that canopy thought a little further and mix it in with
your sub sampling idea. Optimizing can come later, once we figure out how to
do mean-shift in M/R at all. How about this?

1. Each mean-shift iteration consists of a canopy clustering of all the
points, with T1 set to the desired sampling resolution (h?) and T2 set to 1.
This will create one canopy centered on each point in the input set which
contains all of its neighbors that are close enough to influence its next
position in its trajectory (the window?).

2. We then calculate the centroid of each canopy (that's actually done
already by the canopy cluster reducer). Is this centroid not also the
weighted average you desire for the next location of the point at its
center?

3. As the computation proceeds, the canopies will collapse together as their
various centroids move inside the T2=1 radius. At the point when all points
have converged, the remaining canopies will be the mean-shift clusters
(modes?) of the dataset and their contents will be the migrated points in
each cluster.

4. If each original point is duplicated as its own payload, then the
iterations will produce clusters of migrated points whose payloads are the
final contents of each cluster.

Can you wrap your mind around this enough to validate my assumptions? 

Jeff

> -Original Message-
> From: Matthew Riley [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 10, 2008 5:58 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Google Summer of Code[esp. More Clustering]
> 
> Hi Jeff-
> 
> I think your "basin of attraction" understanding is right on. I also like
> your ideas for distributing the mean-shift iterations by following a
> canopy-style method. My intuition was a little different, and I would like
> to hear your ideas on it:
> 
> Just to make sure we're on the same page
> Say we have 1 million point in our original dataset, and we want to
> cluster
> by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
> points from the original dataset and follow the gradient of those points
> to
> the region of highest density (and as we saw from the paper, rather than
> calculate the gradient itself we can equivalently compute the weighted
> average of our subsampled points and move the centroid to that point).
> This
> part seems fairly straightforward to distribute - we just send a different
> subsampled set to each processor and each processor returns the final
> centroid for that set.
> 
> The problem I see is that 10,000 points (or whatever value we choose), may
> be too much for a single processor if we have to compute the distance to
> every single point when we compute the weighted mean. My thought here was
> to
> exploit the fact that we're using a kernel function (gaussian, uniform,
> etc.) in the weighted mean calculation and that kernel will have a set
> radius. Because the radius is static, it may be easy to (quickly) identify
> the points that we must consider in the calculation (i.e. those within the
> radius) by using a locality sensitive hashing scheme, tuned to that
> particular radius. Of course, the degree of advantage we get from this
> method will depend on the data itself, but intuitively I think we will
> usually see a dramatic improvement.
> 
> Honestly, I should do more background work developing this idea, and
> possibly try a matlab implementation to test the feasibility. This sounds
> more like a research paper than something we should dive into immediately,
> but I wanted to share the idea and get some feedback if anyone has
> thoughts...
> 
> Matt
> 
> 
> On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <[EMAIL PROTECTED]>
> wrote:
> 
> > Hi Matthew,
> >
> > I've been looking over the mean-shift papers for the last several days.
> > While the details of the math are still sinking in, it looks like the
> > basic algorithm might be summarized thusly:
> >
> > Points in an n-d feature space are migrated iteratively in the direction
> > of maxima in their local density functions. Points within a "basin of
> > attraction" all converge to the same maxima and thus belong to the same
> > cluster.
> >
> > A physical analogy might be(?):
> >
> > Gas particles in 3-space, operating with gravitational attraction but
> > without momentum, would tend to cluster similarly.
> >
> > The algorithm seems to require that each point be compared with every
> > other point. This might be taken to require each mapper to see all of
> > the points, thus frustrating scalability. OTOH, Canopy clustering avoids
> > this by clustering the clusters produced by the subsets of points seen
> > by each mapper. 

Re: Google Summer of Code[esp. More Clustering]

2008-03-10 Thread Ted Dunning
maller, constant number) that
>> might be employed.
>> 
>> There is a lot of locality in the local density function window, and
>> this could perhaps be exploited. If points could be pre-clustered (as
>> canopy is often used to prime the k-means iterations), parallelization
>> might be feasible.
>> 
>> Are these observations within a "basin of attraction" to your
>> understanding of mean-shift ?
>> 
>> Jeff
>> 
>> 
>> 
>> -Original Message-
>> From: Matthew Riley [mailto:[EMAIL PROTECTED]
>> Sent: Thursday, March 06, 2008 11:46 AM
>> To: mahout-dev@lucene.apache.org
>> Subject: Re: Google Summer of Code[esp. More Clustering]
>> 
>> Hey Jeff-
>> 
>> I'm certainly willing to put some energy into developing implementations
>> of
>> these algorithms, and it's good to hear that you may be interested in
>> guiding us in the right direction.
>> 
>> Here are the references I learned the algorithms from- some are more
>> detailed than others:
>> 
>> Mean-Shift clustering was introduced here and this paper is a thorough
>> reference:
>> Mean-Shift: A Robust Approach to Feature Space Analysis
>> http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
>> 
>> And here's a PDF with just guts of the algorithm outlined:
>> homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>> 
>> It looks like there isn't a definitive reference for the k-means
>> approximation with randomized k-d trees, but there are promising results
>> introduced here:
>> 
>> Object retrieval with large vocabularies and fast spatial matching:
>> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://www
>> .robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
>> *
>> And a deeper explanation of the technique here:
>> 
>> Randomized KD-Trees for Real-Time Keypoint Detection:
>> ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
>> 
>> Let me know what you think.
>> 
>> Matt
>> 
>> On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <[EMAIL PROTECTED]>
>> wrote:
>> 
>>> Hi Matthew,
>>> 
>>> As with most open source projects, "interest" is mainly a function of
>>> the willingness of somebody to contribute their energy. Clustering is
>>> certainly within the scope of the project. I'd be interested in
>>> exploring additional clustering algorithms with you and your
>> colleague.
>>> I'm a complete noob in this area and it is always enlightening to work
>>> with students who have more current theoretical exposures.
>>> 
>>> Do you have some links on these approaches that you find particularly
>>> helpful?
>>> 
>>> Jeff
>>> 
>>> -Original Message-
>>> From: Matthew Riley [mailto:[EMAIL PROTECTED]
>>> Sent: Wednesday, March 05, 2008 11:11 PM
>>> To: mahout-dev@lucene.apache.org; [EMAIL PROTECTED]
>>> Subject: Re: Google Summer of Code
>>> 
>>> Hey everyone-
>>> 
>>> I've been watching the mailing list for a little while now, hoping to
>>> contribute once I became more familiar, but I wanted to jump in here
>> now
>>> and
>>> express my interest in the Summer of Code project. I'm currently a
>>> graduate
>>> student in electrical engineering at UT-Austin working in computer
>>> vision,
>>> which is closely tied to many of the problems Mahout is addressing
>>> (especially in my area of content-based retrieval).
>>> 
>>> What can I do to help out?
>>> 
>>> I've discussed some potential Mahout projects with another student
>>> recently-
>>> mostly focused around approximate k-means algorithms (since that's a
>>> problem
>>> I've been working on lately). It sounds like you guys are already
>>> implementing canopy clustering for k-means- Is there any interest in
>>> developing another approximation algorithm based on randomized
>> kd-trees
>>> for
>>> high dimensional data? What about mean-shift clustering?
>>> 
>>> Again, I would be glad to help in any way I can.
>>> 
>>> Matt
>>> 
>>> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
>>> wrote:
>>> 
>>>> On Saturday 01 March 2008, Grant Ingersoll wrote:
>>>>> Also, any thoughts on what we might want someone to do?  I think
>> it
>>>>> would be great to have someone implement one of the algorithms on
>>> our
>>>>> wiki.
>>>> 
>>>> Just as a general note, the deadline for applications:
>>>> 
>>>> March 12: Mentoring organization application deadline (12 noon
>>> PDT/19:00
>>>> UTC).
>>>> 
>>>> I suppose we should identify interesing tasks until that deadline.
>> As
>>> a
>>>> general guideline for mentors and for project proposals:
>>>> 
>>>> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>>>> 
>>>> Isabel
>>>> 
>>>> --
>>>> Better late than never. -- Titus Livius (Livy)
>>>>   |\  _,,,---,,_   Web:   <http://www.isabel-drost.de>
>>>>  /,`.-'`'-.  ;-;;,_
>>>>  |,4-  ) )-,_..;\ (  `'-'
>>>> '---''(_/--'  `-'\_) (fL)  IM:  
>>>> 
>>> 
>> 



Re: Google Summer of Code[esp. More Clustering]

2008-03-10 Thread Matthew Riley
Hi Jeff-

I think your "basin of attraction" understanding is right on. I also like
your ideas for distributing the mean-shift iterations by following a
canopy-style method. My intuition was a little different, and I would like
to hear your ideas on it:

Just to make sure we're on the same page
Say we have 1 million point in our original dataset, and we want to cluster
by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
points from the original dataset and follow the gradient of those points to
the region of highest density (and as we saw from the paper, rather than
calculate the gradient itself we can equivalently compute the weighted
average of our subsampled points and move the centroid to that point). This
part seems fairly straightforward to distribute - we just send a different
subsampled set to each processor and each processor returns the final
centroid for that set.

The problem I see is that 10,000 points (or whatever value we choose), may
be too much for a single processor if we have to compute the distance to
every single point when we compute the weighted mean. My thought here was to
exploit the fact that we're using a kernel function (gaussian, uniform,
etc.) in the weighted mean calculation and that kernel will have a set
radius. Because the radius is static, it may be easy to (quickly) identify
the points that we must consider in the calculation (i.e. those within the
radius) by using a locality sensitive hashing scheme, tuned to that
particular radius. Of course, the degree of advantage we get from this
method will depend on the data itself, but intuitively I think we will
usually see a dramatic improvement.

Honestly, I should do more background work developing this idea, and
possibly try a matlab implementation to test the feasibility. This sounds
more like a research paper than something we should dive into immediately,
but I wanted to share the idea and get some feedback if anyone has
thoughts...

Matt


On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <[EMAIL PROTECTED]> wrote:

> Hi Matthew,
>
> I've been looking over the mean-shift papers for the last several days.
> While the details of the math are still sinking in, it looks like the
> basic algorithm might be summarized thusly:
>
> Points in an n-d feature space are migrated iteratively in the direction
> of maxima in their local density functions. Points within a "basin of
> attraction" all converge to the same maxima and thus belong to the same
> cluster.
>
> A physical analogy might be(?):
>
> Gas particles in 3-space, operating with gravitational attraction but
> without momentum, would tend to cluster similarly.
>
> The algorithm seems to require that each point be compared with every
> other point. This might be taken to require each mapper to see all of
> the points, thus frustrating scalability. OTOH, Canopy clustering avoids
> this by clustering the clusters produced by the subsets of points seen
> by each mapper. K-means has the requirement that each point needs to be
> compared with all of the cluster centers, not points. It has a similar
> iterative structure over clusters (a much smaller, constant number) that
> might be employed.
>
> There is a lot of locality in the local density function window, and
> this could perhaps be exploited. If points could be pre-clustered (as
> canopy is often used to prime the k-means iterations), parallelization
> might be feasible.
>
> Are these observations within a "basin of attraction" to your
> understanding of mean-shift ?
>
> Jeff
>
>
>
> -----Original Message-----
> From: Matthew Riley [mailto:[EMAIL PROTECTED]
> Sent: Thursday, March 06, 2008 11:46 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Google Summer of Code[esp. More Clustering]
>
> Hey Jeff-
>
> I'm certainly willing to put some energy into developing implementations
> of
> these algorithms, and it's good to hear that you may be interested in
> guiding us in the right direction.
>
> Here are the references I learned the algorithms from- some are more
> detailed than others:
>
> Mean-Shift clustering was introduced here and this paper is a thorough
> reference:
> Mean-Shift: A Robust Approach to Feature Space Analysis
> http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
>
> And here's a PDF with just guts of the algorithm outlined:
> homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>
> It looks like there isn't a definitive reference for the k-means
> approximation with randomized k-d trees, but there are promising results
> introduced here:
>
> Object retrieval with large vocabularies and fast spatial matching:
> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http:

RE: Google Summer of Code[esp. More Clustering]

2008-03-10 Thread Jeff Eastman
Hi Matthew,

I've been looking over the mean-shift papers for the last several days.
While the details of the math are still sinking in, it looks like the
basic algorithm might be summarized thusly:

Points in an n-d feature space are migrated iteratively in the direction
of maxima in their local density functions. Points within a "basin of
attraction" all converge to the same maxima and thus belong to the same
cluster.

A physical analogy might be(?):

Gas particles in 3-space, operating with gravitational attraction but
without momentum, would tend to cluster similarly.

The algorithm seems to require that each point be compared with every
other point. This might be taken to require each mapper to see all of
the points, thus frustrating scalability. OTOH, Canopy clustering avoids
this by clustering the clusters produced by the subsets of points seen
by each mapper. K-means has the requirement that each point needs to be
compared with all of the cluster centers, not points. It has a similar
iterative structure over clusters (a much smaller, constant number) that
might be employed.

There is a lot of locality in the local density function window, and
this could perhaps be exploited. If points could be pre-clustered (as
canopy is often used to prime the k-means iterations), parallelization
might be feasible.

Are these observations within a "basin of attraction" to your
understanding of mean-shift ? 

Jeff

 

-Original Message-
From: Matthew Riley [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 06, 2008 11:46 AM
To: mahout-dev@lucene.apache.org
Subject: Re: Google Summer of Code[esp. More Clustering]

Hey Jeff-

I'm certainly willing to put some energy into developing implementations
of
these algorithms, and it's good to hear that you may be interested in
guiding us in the right direction.

Here are the references I learned the algorithms from- some are more
detailed than others:

Mean-Shift clustering was introduced here and this paper is a thorough
reference:
Mean-Shift: A Robust Approach to Feature Space Analysis
http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf

And here's a PDF with just guts of the algorithm outlined:
homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf

It looks like there isn't a definitive reference for the k-means
approximation with randomized k-d trees, but there are promising results
introduced here:

Object retrieval with large vocabularies and fast spatial matching:
http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*
*
And a deeper explanation of the technique here:

Randomized KD-Trees for Real-Time Keypoint Detection:
ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521

Let me know what you think.

Matt

On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <[EMAIL PROTECTED]>
wrote:

> Hi Matthew,
>
> As with most open source projects, "interest" is mainly a function of
> the willingness of somebody to contribute their energy. Clustering is
> certainly within the scope of the project. I'd be interested in
> exploring additional clustering algorithms with you and your
colleague.
> I'm a complete noob in this area and it is always enlightening to work
> with students who have more current theoretical exposures.
>
> Do you have some links on these approaches that you find particularly
> helpful?
>
> Jeff
>
> -Original Message-
> From: Matthew Riley [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 05, 2008 11:11 PM
> To: mahout-dev@lucene.apache.org; [EMAIL PROTECTED]
> Subject: Re: Google Summer of Code
>
> Hey everyone-
>
> I've been watching the mailing list for a little while now, hoping to
> contribute once I became more familiar, but I wanted to jump in here
now
> and
> express my interest in the Summer of Code project. I'm currently a
> graduate
> student in electrical engineering at UT-Austin working in computer
> vision,
> which is closely tied to many of the problems Mahout is addressing
> (especially in my area of content-based retrieval).
>
> What can I do to help out?
>
> I've discussed some potential Mahout projects with another student
> recently-
> mostly focused around approximate k-means algorithms (since that's a
> problem
> I've been working on lately). It sounds like you guys are already
> implementing canopy clustering for k-means- Is there any interest in
> developing another approximation algorithm based on randomized
kd-trees
> for
> high dimensional data? What about mean-shift clustering?
>
> Again, I would be glad to help in any way I can.
>
> Matt
>
> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
> wrote:
>
> > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > Also, any thoughts on what we 

Re: Google Summer of Code

2008-03-10 Thread Anush Shetty
On Mon, Mar 10, 2008 at 4:50 PM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:

> Wow, maybe w/ all of our mentors we could get 2 students...
>

neat ++ :)



-- 
((Anush Shetty)) ((mail AT anushshetty DOT com))


Re: Google Summer of Code

2008-03-10 Thread Grant Ingersoll

Wow, maybe w/ all of our mentors we could get 2 students...


On Mar 10, 2008, at 3:42 AM, Isabel Drost wrote:


On Saturday 08 March 2008, Grant Ingersoll wrote:

Please feel free to add your name to the list of
mentors if you can.


I have added my name and created an account at the Google SoC web  
application.
Anything else - apart from reading the GSoC documentation we should  
not

forget?

Isabel



Re: Google Summer of Code

2008-03-10 Thread Isabel Drost
On Saturday 08 March 2008, Grant Ingersoll wrote:
> Please feel free to add your name to the list of
> mentors if you can.

I have added my name and created an account at the Google SoC web application. 
Anything else - apart from reading the GSoC documentation we should not 
forget?

Isabel

-- 
A woman forgives the audacity of which her beauty has prompted us to be 
guilty. -- LeSage
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-09 Thread Ian Holsman

Hi Grant.
I'll be happy to mentor someone for this project.

regards
Ian



| A person or group responsible for review and ranking of student
| applications,

I'd be happy to help out here. Anyone else?


Cool





Re: Google Summer of Code

2008-03-09 Thread Grant Ingersoll

You have to create an id and login...

On Mar 9, 2008, at 7:58 PM, Jeff Eastman wrote:


I'd be willing to contribute. The page shows as immutable to me, so
perhaps you could add my name next time you are there.

Jeff

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Saturday, March 08, 2008 1:42 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Google Summer of Code

Note, the deadline for project proposals is March 12.

I put an item up for us at:
http://wiki.apache.org/general/SummerOfCode2008
   I think it is probably general enough to cover all of the bases
discussed here.  Please feel free to add your name to the list of
mentors if you can.  Perhaps we can share duties.

-Grant



On Mar 7, 2008, at 1:43 PM, Isabel Drost wrote:


On Friday 07 March 2008, Grant Ingersoll wrote:

Sounds good.  I should also note that all mentoring should (barring
personal conversation) should take place on the dev list.  That is,
decisions, discussions on what to do should be done on the list so
that we all benefit from the understanding.  Not that you were
suggesting otherwise!


Sure, after all, GSoC is about integrating students into free  
software

projects - and making decisions offline certainly is not the way,
Apache
projects work. Thanks for pointing that out.

Isabel




--
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







RE: Google Summer of Code

2008-03-09 Thread Jeff Eastman
I'd be willing to contribute. The page shows as immutable to me, so
perhaps you could add my name next time you are there.

Jeff

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 08, 2008 1:42 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Google Summer of Code

Note, the deadline for project proposals is March 12.

I put an item up for us at:
http://wiki.apache.org/general/SummerOfCode2008 
I think it is probably general enough to cover all of the bases  
discussed here.  Please feel free to add your name to the list of  
mentors if you can.  Perhaps we can share duties.

-Grant



On Mar 7, 2008, at 1:43 PM, Isabel Drost wrote:

> On Friday 07 March 2008, Grant Ingersoll wrote:
>> Sounds good.  I should also note that all mentoring should (barring
>> personal conversation) should take place on the dev list.  That is,
>> decisions, discussions on what to do should be done on the list so
>> that we all benefit from the understanding.  Not that you were
>> suggesting otherwise!
>
> Sure, after all, GSoC is about integrating students into free software
> projects - and making decisions offline certainly is not the way,  
> Apache
> projects work. Thanks for pointing that out.
>
> Isabel



Re: Google Summer of Code

2008-03-08 Thread Grant Ingersoll

Note, the deadline for project proposals is March 12.

I put an item up for us at: http://wiki.apache.org/general/SummerOfCode2008 
   I think it is probably general enough to cover all of the bases  
discussed here.  Please feel free to add your name to the list of  
mentors if you can.  Perhaps we can share duties.


-Grant



On Mar 7, 2008, at 1:43 PM, Isabel Drost wrote:


On Friday 07 March 2008, Grant Ingersoll wrote:

Sounds good.  I should also note that all mentoring should (barring
personal conversation) should take place on the dev list.  That is,
decisions, discussions on what to do should be done on the list so
that we all benefit from the understanding.  Not that you were
suggesting otherwise!


Sure, after all, GSoC is about integrating students into free software
projects - and making decisions offline certainly is not the way,  
Apache

projects work. Thanks for pointing that out.

Isabel




Re: Google Summer of Code

2008-03-07 Thread Isabel Drost
On Friday 07 March 2008, Grant Ingersoll wrote:
> Sounds good.  I should also note that all mentoring should (barring
> personal conversation) should take place on the dev list.  That is,
> decisions, discussions on what to do should be done on the list so
> that we all benefit from the understanding.  Not that you were
> suggesting otherwise!

Sure, after all, GSoC is about integrating students into free software 
projects - and making decisions offline certainly is not the way, Apache 
projects work. Thanks for pointing that out.

Isabel


-- 
Never pay a compliment as if expecting a receipt.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-07 Thread Grant Ingersoll


On Mar 7, 2008, at 3:08 AM, Isabel Drost wrote:


On Thursday 06 March 2008, Grant Ingersoll wrote:

I think we can split the duties a bit, too.


I think the Apache FAQ also said that - according with the usual  
Apache way of
doing things - it would be ok if the GSoC students would receive  
help from
all community members. So the actual time spent for one mentor could  
very

well drop to about 3h per week.

Still I would not rely on that when accepting the duty to become a  
mentor -

after all, at least officially it is the mentor who is responsible for
encouraging the student.


Sounds good.  I should also note that all mentoring should (barring  
personal conversation) should take place on the dev list.  That is,  
decisions, discussions on what to do should be done on the list so  
that we all benefit from the understanding.  Not that you were  
suggesting otherwise!


-Grant



Re: Google Summer of Code

2008-03-07 Thread Dawid Weiss


What about encouraging your students to submit their work at Mahout? Just a 
naive thought of mine.


Those students I'm in charge of have their area of interest defined already -- 
too late to change it. Good idea for the future, I have been thinking about it, 
actually.


D.


Re: Google Summer of Code

2008-03-07 Thread Isabel Drost
On Thursday 06 March 2008, Matthew Riley wrote:
> I would basically be interested in doing anything that fits in well with
> the overall goals of the Mahout project. Whether that is implementing well
> known algorithms within the Hadoop framework or working on some novel idea
> is up to the mentors, I presume. 

I would be happy with both options: Working on well known algorithms within 
the Hadoop framework certainly is one of our main goals. But at least me 
personally am also interested in providing space for novel ideas. I consider 
it really important for researchers to not only publish the data they 
experimented on but also the implementation used. If working on the latter 
within Mahout helps to maybe focus a little more than usual on scalability 
and maintainability - great.

So if you have an idea that fits well with your day to day work as well as 
with the overall goals of Mahout that would be fine. I would guess, this 
makes it easier to find some spare time to work on the project ;)

Isabel 

-- 
Each new user of a new system uncovers a new class of bugs. -- 
Kernighan
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-07 Thread Isabel Drost
On Thursday 06 March 2008, Grant Ingersoll wrote:
> I think we can split the duties a bit, too. 

I think the Apache FAQ also said that - according with the usual Apache way of 
doing things - it would be ok if the GSoC students would receive help from 
all community members. So the actual time spent for one mentor could very 
well drop to about 3h per week.

Still I would not rely on that when accepting the duty to become a mentor - 
after all, at least officially it is the mentor who is responsible for 
encouraging the student.

Isabel



-- 
The bug stops here.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-06 Thread Grant Ingersoll


On Mar 6, 2008, at 4:36 PM, Matthew Riley wrote:

I would basically be interested in doing anything that fits in well  
with the
overall goals of the Mahout project. Whether that is implementing  
well known
algorithms within the Hadoop framework or working on some novel idea  
is up

to the mentors, I presume. Personally, if I'm going to be working on
something novel, I would like to relate it to my current research  
work...
and I'm happy to discuss that with anyone on the list who is  
interested.




Please do share your research work.  As for novel, versus w/in the  
goals, we like both.  I think at this stage, however, we do want to  
focus on those approaches that have stood the test of time (as short  
as that is) to some extent.  Personally, I would love to see someone  
take on SVM on Hadoop, but I am open to pretty much anything, so...


-Grant


Re: Google Summer of Code[esp. More Clustering]

2008-03-06 Thread Matthew Riley
Hey Grant-

I believe scaling Mean-Shift clustering using M/R will be pretty
straightforward. I'm not as sure about K-Means using KD-Trees, since I
haven't personally implemented that algorithm, but since it follows K-Means
fairly closely I imagine it is possible.

I'll get to work on a proposal with some of my ideas, and hopefully get some
feedback from you guys during the process.

Thanks for all the responses so far.

Matt

On Thu, Mar 6, 2008 at 3:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

> I haven't read the papers, but the big question is do you think they
> can scale using M/R or some other distributed techniques?
>
> If so, feel free to write up a bit of a proposal using the info at:
> http://wiki.apache.org/general/SummerOfCode2008
>   If you are unsure, that is fine too.  We could start with a simpler
> implementation, and then look to distribute it.
>
>
> On Mar 6, 2008, at 2:45 PM, Matthew Riley wrote:
>
> > Hey Jeff-
> >
> > I'm certainly willing to put some energy into developing
> > implementations of
> > these algorithms, and it's good to hear that you may be interested in
> > guiding us in the right direction.
> >
> > Here are the references I learned the algorithms from- some are more
> > detailed than others:
> >
> > Mean-Shift clustering was introduced here and this paper is a thorough
> > reference:
> > Mean-Shift: A Robust Approach to Feature Space Analysis
> > http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
> >
> > And here's a PDF with just guts of the algorithm outlined:
> > homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
> >
> > It looks like there isn't a definitive reference for the k-means
> > approximation with randomized k-d trees, but there are promising
> > results
> > introduced here:
> >
> > Object retrieval with large vocabularies and fast spatial matching:
> > http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
> > *
> > And a deeper explanation of the technique here:
> >
> > Randomized KD-Trees for Real-Time Keypoint Detection:
> > ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
> >
> > Let me know what you think.
> >
> > Matt
> >
> > On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <[EMAIL PROTECTED]>
> > wrote:
> >
> >> Hi Matthew,
> >>
> >> As with most open source projects, "interest" is mainly a function of
> >> the willingness of somebody to contribute their energy. Clustering is
> >> certainly within the scope of the project. I'd be interested in
> >> exploring additional clustering algorithms with you and your
> >> colleague.
> >> I'm a complete noob in this area and it is always enlightening to
> >> work
> >> with students who have more current theoretical exposures.
> >>
> >> Do you have some links on these approaches that you find particularly
> >> helpful?
> >>
> >> Jeff
> >>
> >> -Original Message-
> >> From: Matthew Riley [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, March 05, 2008 11:11 PM
> >> To: mahout-dev@lucene.apache.org; [EMAIL PROTECTED]
> >> Subject: Re: Google Summer of Code
> >>
> >> Hey everyone-
> >>
> >> I've been watching the mailing list for a little while now, hoping to
> >> contribute once I became more familiar, but I wanted to jump in
> >> here now
> >> and
> >> express my interest in the Summer of Code project. I'm currently a
> >> graduate
> >> student in electrical engineering at UT-Austin working in computer
> >> vision,
> >> which is closely tied to many of the problems Mahout is addressing
> >> (especially in my area of content-based retrieval).
> >>
> >> What can I do to help out?
> >>
> >> I've discussed some potential Mahout projects with another student
> >> recently-
> >> mostly focused around approximate k-means algorithms (since that's a
> >> problem
> >> I've been working on lately). It sounds like you guys are already
> >> implementing canopy clustering for k-means- Is there any interest in
> >> developing another approximation algorithm based on randomized kd-
> >> trees
> >> for
> >> high dimensional data? What about mean-shift clustering?
> >>
> >> Again, I would be 

Re: Google Summer of Code

2008-03-06 Thread Matthew Riley
Hey Dawid,

Is it information retrieval from visual data you're working on? We have
> recently
> had a presentation about a guy who implemented motion detection on GPUs
> with
> very impressive speedups (orders of magnitude compared to normal CPUs).
> I'm
> wondering if your expertise here could be used to implement map-reduce
> distributed jobs for running multiple GPUs in parallel. I know this sounds
> a bit
> crazy, but I've heard of bio-engineering companies doing just that --
> running a
> cluster of GPUs to speed up their computations. Just a wild thought. Back
> to
> your proposal though.


Yes, it is basically information retrieval that I'm performing on sets of
images- in fact, a lot of the best algorithms employed today for object
detection, object retrieval, etc. are adaptations of basic text-retrieval
approaches (e.g. tfidf-weighted vector space models). I've personally never
worked with GPUs for image processing, but I imagine the vector processing
abilities would be useful at almost every stage of the indexing and
retrieval processes. I would be interested in looking into those
possibilities in more details.


> > mostly focused around approximate k-means algorithms (since that's a
> problem
> > I've been working on lately). It sounds like you guys are already
> > implementing canopy clustering for k-means- Is there any interest in
> > developing another approximation algorithm based on randomized kd-trees
> for
> > high dimensional data? What about mean-shift clustering?
>
>  From my experience the largest challenge in data clustering is not
> figuring out
> a new clustering methodology, but finding the right existing one to tackle
> a
> particular problem. Isabel mentioned web spam detection challenge --  this
> is a
> good example of a multi-feature classification problem and I know people
> have
> tried clustering the host graph to come up with more coarse-grained
> features for
> hosts. From my own interest, a very interesting challenge is doing
> something
> like Google News does (event aggregation). This is less trivial than you
> might
> think at first -- most news are very similar to each other (copy/paste and
> editing changes), so it's trivial to find small clusters of near-clones.
> Then
> the problem becomes more difficult because all news speak about pretty
> much the
> same people/ events (take presidential election in the U.S.). I think the
> problems you could state here are:
>
> 1) approximating optimal clustering granularity (call it the number of
> clusters
> if you wish, although I think clustering should be driven by other factors
> rather than just the number of clusters),
>
> 2) coming up with clusters of news items _other_ than keyword-based
> similarity.
> One example here is grouping news by region (geolocation), sentiment
> (positive/
> negative news), people-related news, etc.
>
> 3) multilingual news matching and clustering.
>
> All the above issues are on the border of different domains -- NLP,
> clustering,
> classification. The tricky part is being able to put them together. What
> would
> be of interest to you?


These are all interesting problems, actually. I've done some research into
sentiment analysis, as you mentioned in (2), and I think it's still a wide
open problem. Oren Etzioni at UWash does some interesting related work:
www.cs.washington.edu/homes/etzioni/.

I would basically be interested in doing anything that fits in well with the
overall goals of the Mahout project. Whether that is implementing well known
algorithms within the Hadoop framework or working on some novel idea is up
to the mentors, I presume. Personally, if I'm going to be working on
something novel, I would like to relate it to my current research work...
and I'm happy to discuss that with anyone on the list who is interested.

Matt


>
>
> D.
>
> >
> > Again, I would be glad to help in any way I can.
> >
> > Matt
> >
> > On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
> > wrote:
> >
> >> On Saturday 01 March 2008, Grant Ingersoll wrote:
> >>> Also, any thoughts on what we might want someone to do?  I think it
> >>> would be great to have someone implement one of the algorithms on our
> >>> wiki.
> >> Just as a general note, the deadline for applications:
> >>
> >> March 12: Mentoring organization application deadline (12 noon
> PDT/19:00
> >> UTC).
> >>
> >> I suppose we should identify interesing tasks until that deadline. As a
> >> general guideline for mentors and for project proposals:
> >>
> >> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> >>
> >> Isabel
> >>
> >> --
> >> Better late than never. -- Titus Livius (Livy)
> >>   |\  _,,,---,,_   Web:   
> >>  /,`.-'`'-.  ;-;;,_
> >>  |,4-  ) )-,_..;\ (  `'-'
> >> '---''(_/--'  `-'\_) (fL)  IM:  
> >>
> >
>


Re: Google Summer of Code[esp. More Clustering]

2008-03-06 Thread Grant Ingersoll
I haven't read the papers, but the big question is do you think they  
can scale using M/R or some other distributed techniques?


If so, feel free to write up a bit of a proposal using the info at: http://wiki.apache.org/general/SummerOfCode2008 
  If you are unsure, that is fine too.  We could start with a simpler  
implementation, and then look to distribute it.



On Mar 6, 2008, at 2:45 PM, Matthew Riley wrote:


Hey Jeff-

I'm certainly willing to put some energy into developing  
implementations of

these algorithms, and it's good to hear that you may be interested in
guiding us in the right direction.

Here are the references I learned the algorithms from- some are more
detailed than others:

Mean-Shift clustering was introduced here and this paper is a thorough
reference:
Mean-Shift: A Robust Approach to Feature Space Analysis
http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf

And here's a PDF with just guts of the algorithm outlined:
homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf

It looks like there isn't a definitive reference for the k-means
approximation with randomized k-d trees, but there are promising  
results

introduced here:

Object retrieval with large vocabularies and fast spatial matching:
http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*
*
And a deeper explanation of the technique here:

Randomized KD-Trees for Real-Time Keypoint Detection:
ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521

Let me know what you think.

Matt

On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <[EMAIL PROTECTED]>  
wrote:



Hi Matthew,

As with most open source projects, "interest" is mainly a function of
the willingness of somebody to contribute their energy. Clustering is
certainly within the scope of the project. I'd be interested in
exploring additional clustering algorithms with you and your  
colleague.
I'm a complete noob in this area and it is always enlightening to  
work

with students who have more current theoretical exposures.

Do you have some links on these approaches that you find particularly
helpful?

Jeff

-Original Message-
From: Matthew Riley [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 05, 2008 11:11 PM
To: mahout-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Google Summer of Code

Hey everyone-

I've been watching the mailing list for a little while now, hoping to
contribute once I became more familiar, but I wanted to jump in  
here now

and
express my interest in the Summer of Code project. I'm currently a
graduate
student in electrical engineering at UT-Austin working in computer
vision,
which is closely tied to many of the problems Mahout is addressing
(especially in my area of content-based retrieval).

What can I do to help out?

I've discussed some potential Mahout projects with another student
recently-
mostly focused around approximate k-means algorithms (since that's a
problem
I've been working on lately). It sounds like you guys are already
implementing canopy clustering for k-means- Is there any interest in
developing another approximation algorithm based on randomized kd- 
trees

for
high dimensional data? What about mean-shift clustering?

Again, I would be glad to help in any way I can.

Matt

On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED] 
drost.de>

wrote:


On Saturday 01 March 2008, Grant Ingersoll wrote:

Also, any thoughts on what we might want someone to do?  I think it
would be great to have someone implement one of the algorithms on

our

wiki.


Just as a general note, the deadline for applications:

March 12: Mentoring organization application deadline (12 noon

PDT/19:00

UTC).

I suppose we should identify interesing tasks until that deadline.  
As

a

general guideline for mentors and for project proposals:

http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors

Isabel

--
Better late than never. -- Titus Livius (Livy)
 |\  _,,,---,,_   Web:   <http://www.isabel-drost.de>
/,`.-'`'-.  ;-;;,_
|,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  





--
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







Re: Google Summer of Code

2008-03-06 Thread Grant Ingersoll
I think the Mentoring Org is already setup.  After March 3, mentors  
can register.  See http://wiki.apache.org/general/SummerOfCode2008.   
I'm willing to mentor, but would like to share the load a bit too.


-Grant


On Mar 6, 2008, at 1:56 AM, Isabel Drost wrote:


On Saturday 01 March 2008, Grant Ingersoll wrote:

Also, any thoughts on what we might want someone to do?  I think it
would be great to have someone implement one of the algorithms on our
wiki.


Just as a general note, the deadline for applications:

March 12: Mentoring organization application deadline (12 noon PDT/ 
19:00 UTC).


I suppose we should identify interesing tasks until that deadline.  
As a

general guideline for mentors and for project proposals:

http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors

Isabel

--
Better late than never. -- Titus Livius (Livy)
 |\  _,,,---,,_   Web:   
 /,`.-'`'-.  ;-;;,_
|,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  





Re: Google Summer of Code

2008-03-06 Thread Grant Ingersoll
I think we can split the duties a bit, too.  Simon, can you share your  
experience since you did a GSOC for Lucene a few years back?  I seem  
to recall there being a couple of Lucene mentors.



On Mar 6, 2008, at 1:47 AM, Isabel Drost wrote:


Finally found a general answer to my question:

| While the answer to this question will vary widely depending on  
the number
| of students a mentor works with, the difficulty of the proposals  
and the

| skill level of the students, most mentors have let us know that they
| underestimated the amount of time they would need to invest in  
GSoC. Five

| hours per student per week is a reasonable estimate.

Sounds like doing it part time after work is going to be a bit tough  
- but
five hours per student per week in summer should be doable, at least  
for me.


As these are summer projects and certainly some of us are going to  
be on
vacation during that time we should plan for having one backup for  
each

mentor that goes on vacation at a different time.


Isabel


--
You were s'posed to laugh!
 |\  _,,,---,,_   Web:   
 /,`.-'`'-.  ;-;;,_
|,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  





Re: Google Summer of Code

2008-03-06 Thread Grant Ingersoll


On Mar 6, 2008, at 1:32 AM, Isabel Drost wrote:


On Wednesday 05 March 2008, Simon Willnauer wrote:

You could have a look at the FAQ or the GSoC pages
http://code.google.com/soc/2008/ and
http://code.google.com/soc/2008/faqs.html respectively.


Hmm, there is little about what mentors are expected apart from the  
following

rather general question, is there?

| 2. What is the role of a mentoring organization?

If we want to take part in GSoC, from that question, I guess we need  
a little

more than only mentors:

| A pool of project ideas for students to choose from.

Grant already asked for ideas.

| An organization administrator to act as the project's main point  
of contact

| for Google;

Any volunteers?


I think this is covered by the ASF.




| A person or group responsible for review and ranking of student
| applications,

I'd be happy to help out here. Anyone else?


Cool


Re: Google Summer of Code[esp. More Clustering]

2008-03-06 Thread Matthew Riley
Hey Jeff-

I'm certainly willing to put some energy into developing implementations of
these algorithms, and it's good to hear that you may be interested in
guiding us in the right direction.

Here are the references I learned the algorithms from- some are more
detailed than others:

Mean-Shift clustering was introduced here and this paper is a thorough
reference:
Mean-Shift: A Robust Approach to Feature Space Analysis
http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf

And here's a PDF with just guts of the algorithm outlined:
homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf

It looks like there isn't a definitive reference for the k-means
approximation with randomized k-d trees, but there are promising results
introduced here:

Object retrieval with large vocabularies and fast spatial matching:
http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*
*
And a deeper explanation of the technique here:

Randomized KD-Trees for Real-Time Keypoint Detection:
ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521

Let me know what you think.

Matt

On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <[EMAIL PROTECTED]> wrote:

> Hi Matthew,
>
> As with most open source projects, "interest" is mainly a function of
> the willingness of somebody to contribute their energy. Clustering is
> certainly within the scope of the project. I'd be interested in
> exploring additional clustering algorithms with you and your colleague.
> I'm a complete noob in this area and it is always enlightening to work
> with students who have more current theoretical exposures.
>
> Do you have some links on these approaches that you find particularly
> helpful?
>
> Jeff
>
> -Original Message-
> From: Matthew Riley [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 05, 2008 11:11 PM
> To: mahout-dev@lucene.apache.org; [EMAIL PROTECTED]
> Subject: Re: Google Summer of Code
>
> Hey everyone-
>
> I've been watching the mailing list for a little while now, hoping to
> contribute once I became more familiar, but I wanted to jump in here now
> and
> express my interest in the Summer of Code project. I'm currently a
> graduate
> student in electrical engineering at UT-Austin working in computer
> vision,
> which is closely tied to many of the problems Mahout is addressing
> (especially in my area of content-based retrieval).
>
> What can I do to help out?
>
> I've discussed some potential Mahout projects with another student
> recently-
> mostly focused around approximate k-means algorithms (since that's a
> problem
> I've been working on lately). It sounds like you guys are already
> implementing canopy clustering for k-means- Is there any interest in
> developing another approximation algorithm based on randomized kd-trees
> for
> high dimensional data? What about mean-shift clustering?
>
> Again, I would be glad to help in any way I can.
>
> Matt
>
> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
> wrote:
>
> > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > Also, any thoughts on what we might want someone to do?  I think it
> > > would be great to have someone implement one of the algorithms on
> our
> > > wiki.
> >
> > Just as a general note, the deadline for applications:
> >
> > March 12: Mentoring organization application deadline (12 noon
> PDT/19:00
> > UTC).
> >
> > I suppose we should identify interesing tasks until that deadline. As
> a
> > general guideline for mentors and for project proposals:
> >
> > http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> >
> > Isabel
> >
> > --
> > Better late than never. -- Titus Livius (Livy)
> >   |\  _,,,---,,_   Web:   <http://www.isabel-drost.de>
> >  /,`.-'`'-.  ;-;;,_
> >  |,4-  ) )-,_..;\ (  `'-'
> > '---''(_/--'  `-'\_) (fL)  IM:  
> >
>


Re: Google Summer of Code

2008-03-06 Thread Isabel Drost
On Thursday 06 March 2008, Dawid Weiss wrote:
> > five hours per student per week in summer should be doable, at least for
> > me.
>
> I'm out, unfortunately -- my academic job already comes with a bunch of
> students to take care of. It's fun, but it's time-consuming.

What about encouraging your students to submit their work at Mahout? Just a 
naive thought of mine.

Isabel

-- 
Immature poets imitate, mature poets steal. -- T. S. Eliot, "Philip 
Massinger"
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


RE: Google Summer of Code[esp. More Clustering]

2008-03-06 Thread Jeff Eastman
Hi Matthew,

As with most open source projects, "interest" is mainly a function of
the willingness of somebody to contribute their energy. Clustering is
certainly within the scope of the project. I'd be interested in
exploring additional clustering algorithms with you and your colleague.
I'm a complete noob in this area and it is always enlightening to work
with students who have more current theoretical exposures.

Do you have some links on these approaches that you find particularly
helpful?

Jeff

-Original Message-
From: Matthew Riley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 05, 2008 11:11 PM
To: mahout-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Google Summer of Code

Hey everyone-

I've been watching the mailing list for a little while now, hoping to
contribute once I became more familiar, but I wanted to jump in here now
and
express my interest in the Summer of Code project. I'm currently a
graduate
student in electrical engineering at UT-Austin working in computer
vision,
which is closely tied to many of the problems Mahout is addressing
(especially in my area of content-based retrieval).

What can I do to help out?

I've discussed some potential Mahout projects with another student
recently-
mostly focused around approximate k-means algorithms (since that's a
problem
I've been working on lately). It sounds like you guys are already
implementing canopy clustering for k-means- Is there any interest in
developing another approximation algorithm based on randomized kd-trees
for
high dimensional data? What about mean-shift clustering?

Again, I would be glad to help in any way I can.

Matt

On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
wrote:

> On Saturday 01 March 2008, Grant Ingersoll wrote:
> > Also, any thoughts on what we might want someone to do?  I think it
> > would be great to have someone implement one of the algorithms on
our
> > wiki.
>
> Just as a general note, the deadline for applications:
>
> March 12: Mentoring organization application deadline (12 noon
PDT/19:00
> UTC).
>
> I suppose we should identify interesing tasks until that deadline. As
a
> general guideline for mentors and for project proposals:
>
> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>
> Isabel
>
> --
> Better late than never. -- Titus Livius (Livy)
>   |\  _,,,---,,_   Web:   <http://www.isabel-drost.de>
>  /,`.-'`'-.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  
>


Re: Google Summer of Code

2008-03-06 Thread Dawid Weiss


Hi Matthew,


student in electrical engineering at UT-Austin working in computer vision,
which is closely tied to many of the problems Mahout is addressing
(especially in my area of content-based retrieval).


Is it information retrieval from visual data you're working on? We have recently 
had a presentation about a guy who implemented motion detection on GPUs with 
very impressive speedups (orders of magnitude compared to normal CPUs). I'm 
wondering if your expertise here could be used to implement map-reduce 
distributed jobs for running multiple GPUs in parallel. I know this sounds a bit 
crazy, but I've heard of bio-engineering companies doing just that -- running a 
cluster of GPUs to speed up their computations. Just a wild thought. Back to 
your proposal though.



mostly focused around approximate k-means algorithms (since that's a problem
I've been working on lately). It sounds like you guys are already
implementing canopy clustering for k-means- Is there any interest in
developing another approximation algorithm based on randomized kd-trees for
high dimensional data? What about mean-shift clustering?


From my experience the largest challenge in data clustering is not figuring out 
a new clustering methodology, but finding the right existing one to tackle a 
particular problem. Isabel mentioned web spam detection challenge --  this is a 
good example of a multi-feature classification problem and I know people have 
tried clustering the host graph to come up with more coarse-grained features for 
hosts. From my own interest, a very interesting challenge is doing something 
like Google News does (event aggregation). This is less trivial than you might 
think at first -- most news are very similar to each other (copy/paste and 
editing changes), so it's trivial to find small clusters of near-clones. Then 
the problem becomes more difficult because all news speak about pretty much the 
same people/ events (take presidential election in the U.S.). I think the 
problems you could state here are:


1) approximating optimal clustering granularity (call it the number of clusters 
if you wish, although I think clustering should be driven by other factors 
rather than just the number of clusters),


2) coming up with clusters of news items _other_ than keyword-based similarity. 
One example here is grouping news by region (geolocation), sentiment (positive/ 
negative news), people-related news, etc.


3) multilingual news matching and clustering.

All the above issues are on the border of different domains -- NLP, clustering, 
classification. The tricky part is being able to put them together. What would 
be of interest to you?


D.



Again, I would be glad to help in any way I can.

Matt

On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
wrote:


On Saturday 01 March 2008, Grant Ingersoll wrote:

Also, any thoughts on what we might want someone to do?  I think it
would be great to have someone implement one of the algorithms on our
wiki.

Just as a general note, the deadline for applications:

March 12: Mentoring organization application deadline (12 noon PDT/19:00
UTC).

I suppose we should identify interesing tasks until that deadline. As a
general guideline for mentors and for project proposals:

http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors

Isabel

--
Better late than never. -- Titus Livius (Livy)
  |\  _,,,---,,_   Web:   
 /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  





Re: Google Summer of Code

2008-03-06 Thread Dawid Weiss



five hours per student per week in summer should be doable, at least for me.


I'm out, unfortunately -- my academic job already comes with a bunch of students 
to take care of. It's fun, but it's time-consuming.


D.


Re: Google Summer of Code

2008-03-05 Thread Matthew Riley
Hey everyone-

I've been watching the mailing list for a little while now, hoping to
contribute once I became more familiar, but I wanted to jump in here now and
express my interest in the Summer of Code project. I'm currently a graduate
student in electrical engineering at UT-Austin working in computer vision,
which is closely tied to many of the problems Mahout is addressing
(especially in my area of content-based retrieval).

What can I do to help out?

I've discussed some potential Mahout projects with another student recently-
mostly focused around approximate k-means algorithms (since that's a problem
I've been working on lately). It sounds like you guys are already
implementing canopy clustering for k-means- Is there any interest in
developing another approximation algorithm based on randomized kd-trees for
high dimensional data? What about mean-shift clustering?

Again, I would be glad to help in any way I can.

Matt

On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
wrote:

> On Saturday 01 March 2008, Grant Ingersoll wrote:
> > Also, any thoughts on what we might want someone to do?  I think it
> > would be great to have someone implement one of the algorithms on our
> > wiki.
>
> Just as a general note, the deadline for applications:
>
> March 12: Mentoring organization application deadline (12 noon PDT/19:00
> UTC).
>
> I suppose we should identify interesing tasks until that deadline. As a
> general guideline for mentors and for project proposals:
>
> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>
> Isabel
>
> --
> Better late than never. -- Titus Livius (Livy)
>   |\  _,,,---,,_   Web:   
>  /,`.-'`'-.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  
>


Re: Google Summer of Code

2008-03-05 Thread Isabel Drost
On Saturday 01 March 2008, Grant Ingersoll wrote:
> Also, any thoughts on what we might want someone to do?  I think it
> would be great to have someone implement one of the algorithms on our
> wiki.

Just as a general note, the deadline for applications:

March 12: Mentoring organization application deadline (12 noon PDT/19:00 UTC).

I suppose we should identify interesing tasks until that deadline. As a 
general guideline for mentors and for project proposals:

http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors

Isabel

-- 
Better late than never. -- Titus Livius (Livy)
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-05 Thread Isabel Drost
Finally found a general answer to my question:

| While the answer to this question will vary widely depending on the number
| of students a mentor works with, the difficulty of the proposals and the
| skill level of the students, most mentors have let us know that they
| underestimated the amount of time they would need to invest in GSoC. Five
| hours per student per week is a reasonable estimate.

Sounds like doing it part time after work is going to be a bit tough - but 
five hours per student per week in summer should be doable, at least for me.

As these are summer projects and certainly some of us are going to be on 
vacation during that time we should plan for having one backup for each 
mentor that goes on vacation at a different time.


Isabel


-- 
You were s'posed to laugh!
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-05 Thread Isabel Drost
On Wednesday 05 March 2008, Simon Willnauer wrote:
> You could have a look at the FAQ or the GSoC pages
> http://code.google.com/soc/2008/ and
> http://code.google.com/soc/2008/faqs.html respectively.

Hmm, there is little about what mentors are expected apart from the following 
rather general question, is there?

| 2. What is the role of a mentoring organization?

If we want to take part in GSoC, from that question, I guess we need a little 
more than only mentors:

| A pool of project ideas for students to choose from.

Grant already asked for ideas.

| An organization administrator to act as the project's main point of contact
| for Google;  

Any volunteers?

| A person or group responsible for review and ranking of student
| applications,

I'd be happy to help out here. Anyone else?


| A person or group of people responsible for monitoring the progress of each
| accepted student and to mentor her/him as the project progresses;  + backup

That would be the mentors Grant already mentioned.


| A written evaluation of each student participant, including how s/he worked
| with the group, whether s/he should be invited back should we do another
| Google Summer of Code, etc.   

I guess this could be done by each member but should be reviewed by more than 
one person, as it looks like the evaluations are going to be highly 
subjective.


> Or join the #gsoc IRC channel on freenode.

Sorry, but working on Mahout only after work I usually do not have the time to 
follow irc channels :(

Anyone here, who already took part in GSoC and could give us a little summary 
of her experiences? Is it possible to do the mentoring job in the freetime 
after work or should one better plan more time than that?

Isabel


-- 
"Remember kids, if there's a loaded gun in the room, be sure that you're the 
one holding it" -- Captain Combat
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-05 Thread Simon Willnauer
On Wed, Mar 5, 2008 at 8:52 PM, Isabel Drost
<[EMAIL PROTECTED]> wrote:
> On Saturday 01 March 2008, Grant Ingersoll wrote:
>  > > Any of the other committers willing to mentor?
>
>  Could you please clarify - or point to a page that does so - about what it
>  means to become a Mentor? Anyone have any experience being a mentor? I would
>  be happy to help - but I would rather learn a bit more about the mentor side
>  of GSoC

You could have a look at the FAQ or the GSoC pages
http://code.google.com/soc/2008/ and
http://code.google.com/soc/2008/faqs.html respectively.
Or join the #gsoc IRC channel on freenode.

you could also contact some of the google folks they are very helpful
if you have questions beyond the FAQ. (watch out for "lh" in the IRC
channel)

best regards,

simon
>
>
>  > Also, any thoughts on what we might want someone to do?  I think it
>  > would be great to have someone implement one of the algorithms on our
>  > wiki.
>
>  I think just implementing one of the algorithms might help Mahout but it 
> might
>  be a bit hard to attract students to do that without some real task at hand.
>
>  What about putting up tasks that solve problems e.g. from this years KDD cup
>  or the web spam challenge? Than the benefit for participants would be two
>  fold - first they would help Mahout and second they could compete with others
>  in the field.
>
>  Isabel
>
>  --
>  A man's best friend is his dogma.
>   |\  _,,,---,,_   Web:   
>   /,`.-'`'-.  ;-;;,_
>   |,4-  ) )-,_..;\ (  `'-'
>  '---''(_/--'  `-'\_) (fL)  IM:  
>


Re: Google Summer of Code

2008-03-05 Thread Isabel Drost
On Saturday 01 March 2008, Grant Ingersoll wrote:
> > Any of the other committers willing to mentor? 

Could you please clarify - or point to a page that does so - about what it 
means to become a Mentor? Anyone have any experience being a mentor? I would 
be happy to help - but I would rather learn a bit more about the mentor side 
of GSoC 

> Also, any thoughts on what we might want someone to do?  I think it
> would be great to have someone implement one of the algorithms on our
> wiki.

I think just implementing one of the algorithms might help Mahout but it might 
be a bit hard to attract students to do that without some real task at hand.

What about putting up tasks that solve problems e.g. from this years KDD cup 
or the web spam challenge? Than the benefit for participants would be two 
fold - first they would help Mahout and second they could compete with others 
in the field.

Isabel

-- 
A man's best friend is his dogma.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-01 Thread Grant Ingersoll
Well, here's your chance.  Make a proposal of something you would like  
to work on that fits with what we are doing and we'll discuss it and  
possibly put it up as a project.


I think it would be great if anyone took on something like M/R SVM  
implementation, or one of the other ones that is not already under way.


-Grant

On Mar 1, 2008, at 2:08 AM, [EMAIL PROTECTED] wrote:


Hi Gang,

I think we should put in for this:
http://wiki.apache.org/general/SummerOfCode2008

I would be there are some students interested in doing ML on Hadoop.

Yes. I would be happy to work :) Didn't know that Mahout is also
participating in SoC.

Any of the other committers willing to mentor?  I am, but would also
like some others to help out if you have the time.  See
http://wiki.apache.org/general/SummerOfCodeMentor
.


Thanks,
Grant













Re: Google Summer of Code

2008-02-29 Thread jaideep
> Hi Gang,
>
> I think we should put in for this:
> http://wiki.apache.org/general/SummerOfCode2008
>
> I would be there are some students interested in doing ML on Hadoop.
Yes. I would be happy to work :) Didn't know that Mahout is also
participating in SoC.
> Any of the other committers willing to mentor?  I am, but would also
> like some others to help out if you have the time.  See
> http://wiki.apache.org/general/SummerOfCodeMentor
> .
>
>
> Thanks,
> Grant
>
>
>
>





Re: Google Summer of Code

2008-02-29 Thread Grant Ingersoll
Also, any thoughts on what we might want someone to do?  I think it  
would be great to have someone implement one of the algorithms on our  
wiki.


-Grant

On Feb 29, 2008, at 9:33 PM, Grant Ingersoll wrote:


Hi Gang,

I think we should put in for this: 
http://wiki.apache.org/general/SummerOfCode2008

I would be there are some students interested in doing ML on  
Hadoop.  Any of the other committers willing to mentor?  I am, but  
would also like some others to help out if you have the time.  See http://wiki.apache.org/general/SummerOfCodeMentor 
.



Thanks,
Grant