[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

Shashikant Kore (JIRA) Wed, 10 Feb 2010 02:05:02 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831926#action_12831926
 ]


Shashikant Kore commented on MAHOUT-153:
----------------------------------------

Pallavi,

I can see two potential improvements. (I haven't really tried this patch, 
though.)

1. For distance measure, if you use the distance() method which accepts one 
additional parameter (named, rather inaccurately, "centroid length square"), 
distance calculations can go really fast. These changes were done in Mahout-121.

2. In the map phase, distance with all the centroids is calculated. In each 
iteration one new seed is selected. Ideally, distance with only this new seed 
needs to be calculated as the distances of a vector remains unchanged for other 
seeds. But, this requires maintaining states (distance of vector to nearest 
centroid) across runs. This distance needs to be compared with distance to 
newly selected centroid.

A minor correction in map(). Float point comparison can't be done with !=. So 
you may have change that to something like min < 0.0001. 

> Implement kmeans++ for initial cluster selection in kmeans
> ----------------------------------------------------------
>
>                 Key: MAHOUT-153
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-153
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>         Environment: OS Independent
>            Reporter: Panagiotis Papadimitriou
>            Assignee: Ted Dunning
>             Fix For: 0.4
>
>         Attachments: Mahout-153.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

Reply via email to