[ https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng updated SPARK-4510: --------------------------------- Labels: clustering features (was: features) > Add k-medoids Partitioning Around Medoids (PAM) algorithm > --------------------------------------------------------- > > Key: SPARK-4510 > URL: https://issues.apache.org/jira/browse/SPARK-4510 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Fan Jiang > Assignee: Fan Jiang > Labels: clustering, features > Original Estimate: 0h > Remaining Estimate: 0h > > PAM (k-medoids) is more robust to noise and outliers as compared to k-means > because it minimizes a sum of pairwise dissimilarities instead of a sum of > squared Euclidean distances. A medoid can be defined as the object of a > cluster, whose average dissimilarity to all the objects in the cluster is > minimal i.e. it is a most centrally located point in the cluster. > The most common realisation of k-medoid clustering is the Partitioning Around > Medoids (PAM) algorithm and is as follows: > Initialize: randomly select (without replacement) k of the n data points as > the medoids > Associate each data point to the closest medoid. ("closest" here is defined > using any valid distance metric, most commonly Euclidean distance, Manhattan > distance or Minkowski distance) > For each medoid m > For each non-medoid data point o > Swap m and o and compute the total cost of the configuration > Select the configuration with the lowest cost. > Repeat steps 2 to 4 until there is no change in the medoid. > The new feature for MLlib will contain 5 new files > /main/scala/org/apache/spark/mllib/clustering/PAM.scala > /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala > /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala > /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala > /main/scala/org/apache/spark/examples/mllib/KMedoids.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org