Hi Matthew,

I've been looking over the mean-shift papers for the last several days.
While the details of the math are still sinking in, it looks like the
basic algorithm might be summarized thusly:

Points in an n-d feature space are migrated iteratively in the direction
of maxima in their local density functions. Points within a "basin of
attraction" all converge to the same maxima and thus belong to the same
cluster.

A physical analogy might be(?):

Gas particles in 3-space, operating with gravitational attraction but
without momentum, would tend to cluster similarly.

The algorithm seems to require that each point be compared with every
other point. This might be taken to require each mapper to see all of
the points, thus frustrating scalability. OTOH, Canopy clustering avoids
this by clustering the clusters produced by the subsets of points seen
by each mapper. K-means has the requirement that each point needs to be
compared with all of the cluster centers, not points. It has a similar
iterative structure over clusters (a much smaller, constant number) that
might be employed.

There is a lot of locality in the local density function window, and
this could perhaps be exploited. If points could be pre-clustered (as
canopy is often used to prime the k-means iterations), parallelization
might be feasible.

Are these observations within a "basin of attraction" to your
understanding of mean-shift <grin>? 

Jeff

 

-----Original Message-----
From: Matthew Riley [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 06, 2008 11:46 AM
To: mahout-dev@lucene.apache.org
Subject: Re: Google Summer of Code[esp. More Clustering]

Hey Jeff-

I'm certainly willing to put some energy into developing implementations
of
these algorithms, and it's good to hear that you may be interested in
guiding us in the right direction.

Here are the references I learned the algorithms from- some are more
detailed than others:

Mean-Shift clustering was introduced here and this paper is a thorough
reference:
Mean-Shift: A Robust Approach to Feature Space Analysis
http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf

And here's a PDF with just guts of the algorithm outlined:
homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf

It looks like there isn't a definitive reference for the k-means
approximation with randomized k-d trees, but there are promising results
introduced here:

Object retrieval with large vocabularies and fast spatial matching:
http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*
*
And a deeper explanation of the technique here:

Randomized KD-Trees for Real-Time Keypoint Detection:
ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521

Let me know what you think.

Matt

On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <[EMAIL PROTECTED]>
wrote:

> Hi Matthew,
>
> As with most open source projects, "interest" is mainly a function of
> the willingness of somebody to contribute their energy. Clustering is
> certainly within the scope of the project. I'd be interested in
> exploring additional clustering algorithms with you and your
colleague.
> I'm a complete noob in this area and it is always enlightening to work
> with students who have more current theoretical exposures.
>
> Do you have some links on these approaches that you find particularly
> helpful?
>
> Jeff
>
> -----Original Message-----
> From: Matthew Riley [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 05, 2008 11:11 PM
> To: mahout-dev@lucene.apache.org; [EMAIL PROTECTED]
> Subject: Re: Google Summer of Code
>
> Hey everyone-
>
> I've been watching the mailing list for a little while now, hoping to
> contribute once I became more familiar, but I wanted to jump in here
now
> and
> express my interest in the Summer of Code project. I'm currently a
> graduate
> student in electrical engineering at UT-Austin working in computer
> vision,
> which is closely tied to many of the problems Mahout is addressing
> (especially in my area of content-based retrieval).
>
> What can I do to help out?
>
> I've discussed some potential Mahout projects with another student
> recently-
> mostly focused around approximate k-means algorithms (since that's a
> problem
> I've been working on lately). It sounds like you guys are already
> implementing canopy clustering for k-means- Is there any interest in
> developing another approximation algorithm based on randomized
kd-trees
> for
> high dimensional data? What about mean-shift clustering?
>
> Again, I would be glad to help in any way I can.
>
> Matt
>
> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
> wrote:
>
> > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > Also, any thoughts on what we might want someone to do?  I think
it
> > > would be great to have someone implement one of the algorithms on
> our
> > > wiki.
> >
> > Just as a general note, the deadline for applications:
> >
> > March 12: Mentoring organization application deadline (12 noon
> PDT/19:00
> > UTC).
> >
> > I suppose we should identify interesing tasks until that deadline.
As
> a
> > general guideline for mentors and for project proposals:
> >
> > http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> >
> > Isabel
> >
> > --
> > Better late than never.         -- Titus Livius (Livy)
> >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> >  /,`.-'`'    -.  ;-;;,_
> >  |,4-  ) )-,_..;\ (  `'-'
> > '---''(_/--'  `-'\_) (fL)  IM:  <xmpp://[EMAIL PROTECTED]>
> >
>

Reply via email to