[ https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun resolved SPARK-27896. ----------------------------------- Resolution: Fixed Fix Version/s: 3.0.0 2.4.4 This is resolved via https://github.com/apache/spark/pull/24756 > Fix definition of clustering silhouette coefficient for 1-element clusters > -------------------------------------------------------------------------- > > Key: SPARK-27896 > URL: https://issues.apache.org/jira/browse/SPARK-27896 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.4.3 > Reporter: Sean Owen > Assignee: Sean Owen > Priority: Minor > Fix For: 2.4.4, 3.0.0 > > > Reported by Samuel Kubler via email: > In the code > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala, > I think there is a little mistake in the class “Silhouette” when you > calculate the Silhouette coefficient for a point. Indeed, according to the > scientific paper of reference “Silhouettes: a graphical aid to the > interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, > for the points which are alone in a cluster it is not the > currentClusterDissimilarity which is supposed to be equal to 0 like it is > the case in your code (“val currentClusterDissimilarity = if > (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. > Indeed, “When cluster A contains only a single object it is unclear how a(i) > should be defined, and the we simply set s(i) equal to zero”. > The problem of defining the currentClusterDissimilarity to zero like you have > done is that you can’t use the silhouette coefficient anymore as a criterion > to determine the optimal value of the number of clusters in your clustering > process because your algorithm will answer that the more clusters you have, > the better will be your clustering algorithm. Indeed, in that case when the > number of clustering classes increases, s(i) converges toward 1. (so your > algorithm seems to be more efficient) I have, beside, check this result of my > own clustering example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org