Sean Owen created SPARK-27896:
---------------------------------

             Summary: Fix definition of clustering silhouette coefficient for 
1-element clusters
                 Key: SPARK-27896
                 URL: https://issues.apache.org/jira/browse/SPARK-27896
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.4.3
            Reporter: Sean Owen
            Assignee: Sean Owen


Reported by Samuel Kubler via email:

In the code 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala,
 I think there is a little mistake in the class “Silhouette” when you calculate 
the Silhouette coefficient for a point. Indeed, according to the scientific 
paper of reference “Silhouettes:  a graphical aid to the interpretation and 
validation of cluster analysis” Peter J. ROUSSEEUW 1986, for the points which 
are alone in a cluster it is not the currentClusterDissimilarity  which is 
supposed to be equal to 0 like it is the case in your code (“val 
currentClusterDissimilarity = if (pointClusterNumOfPoints == 1) {0.0}” but the 
silhouette coefficient itself. Indeed, “When cluster A contains only a single 
object it is unclear how a(i) should be defined, and the we simply set s(i) 
equal to zero”.

The problem of defining the currentClusterDissimilarity to zero like you have 
done is that you can’t use the silhouette coefficient anymore as a criterion to 
determine the optimal value of the number of clusters in your clustering 
process because your algorithm will answer that the more clusters you have, the 
better will be your clustering algorithm. Indeed, in that case when the number 
of clustering classes increases, s(i) converges toward 1. (so your algorithm 
seems to be more efficient) I have, beside, check this result of my own 
clustering example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to