[ 
https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27896.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0
                   2.4.4

This is resolved via https://github.com/apache/spark/pull/24756

> Fix definition of clustering silhouette coefficient for 1-element clusters
> --------------------------------------------------------------------------
>
>                 Key: SPARK-27896
>                 URL: https://issues.apache.org/jira/browse/SPARK-27896
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.4.3
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 2.4.4, 3.0.0
>
>
> Reported by Samuel Kubler via email:
> In the code 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala,
>  I think there is a little mistake in the class “Silhouette” when you 
> calculate the Silhouette coefficient for a point. Indeed, according to the 
> scientific paper of reference “Silhouettes:  a graphical aid to the 
> interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, 
> for the points which are alone in a cluster it is not the 
> currentClusterDissimilarity  which is supposed to be equal to 0 like it is 
> the case in your code (“val currentClusterDissimilarity = if 
> (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. 
> Indeed, “When cluster A contains only a single object it is unclear how a(i) 
> should be defined, and the we simply set s(i) equal to zero”.
> The problem of defining the currentClusterDissimilarity to zero like you have 
> done is that you can’t use the silhouette coefficient anymore as a criterion 
> to determine the optimal value of the number of clusters in your clustering 
> process because your algorithm will answer that the more clusters you have, 
> the better will be your clustering algorithm. Indeed, in that case when the 
> number of clustering classes increases, s(i) converges toward 1. (so your 
> algorithm seems to be more efficient) I have, beside, check this result of my 
> own clustering example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to