[CONF] Apache Lucene Mahout: Canopy Clustering (page edited)

confluence Thu, 07 Feb 2008 18:43:35 -0800

Canopy Clustering (MAHOUT) edited by Jeff Eastman
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
   Changes: 
http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=75998&originalVersion=2&revisedVersion=3

Content:
---------------------------------------------------------------------

h1. Canopy Clustering

[Canopy Clustering|http://www.kamalnigam.com/papers/canopy-kdd00.pdf] is a very
simple, fast and surprisingly accurate method for grouping objects into
clusters. All objects are represented as a point in a multidimensional feature
space. The algorithm uses a fast approximate distance metric and two distance
thresholds T1 > T2 for processing. The basic algorithm is to begin with a set
of points and remove one at random. Create a Canopy containing this point and
iterate through the remainder of the point set. At each point, if its distance
from the first point is < T1, then add the point to the cluster. If, in
addition, the distance is < T2, then remove the point from the set. This way
points that are very close to the original will avoid all further processing.
The algorithm loops until the initial set is empty, accumulating a set of
Canopies, each containing one or more points. A given point may occur in more
than one Canopy.

Canopy Clustering is often used as an initial step in more rigorous clustering
techniques, such as [k-Means]. By starting with an initial clustering the
number of more expensive distance measurements can be significantly reduced by
ignoring points outside of the initial canopies.

h2. Strategy for parallelization

Looking at the sample Hadoop implementation in
[http://code.google.com/p/canopy-clustering/] the processing is done in 2 M/R
steps:
# The data is massaged into suitable input format
# Each mapper performs canopy clustering on the points in its input set and
outputs its canopies
# The reducer clusters the canopy centers and merges the points

Some ideas can be found in [Cluster computing and
MapReduce|http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html]
lecture video series \[by Google(r)\]; Canopy Clustering is discussed in
[lecture #4|http://www.youtube.com/watch?v=1ZDybXl212Q]. Slides can be found
[here|http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt].
Finally here is the [Wikipedia
page|http://en.wikipedia.org/wiki/Canopy_clustering_algorithm].

h2. Design of implementation

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
http://www.atlassian.com/software/confluence

[CONF] Apache Lucene Mahout: Canopy Clustering (page edited)

Reply via email to