Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20629 @holdenk I am not sure about requiring or not cluster centers for this metric. On one side, since the `ClusteringEvaluator` should be a general interface for all clustering models and some of them don't provide cluster centers, it may be a good idea to compute them if necessary. On the other, does this metric make sense for any model other than KMeans? And computing the centers of the test dataset would lead to different results than the old API we are replacing. So I am not sure it is the right thing to do. Honestly, the more we go on the more my feeling is that we don't really need to move that metric here. We can just deprecate it saying that there are better metrics for evaluating a clustering available in the `ClusteringEvaluator` (namely the silhouette). In these way people can move away from using this metric. Moreover, sklearn - which is one of the most widespread tool - doesn't offer the ability of computing such a cost (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). The only thing sklearn offers is what it calls `inertia` (https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/cluster/k_means_.py#L265), ie. the cost computed on the training set. So, I think the best option would be to follow what sklearn does: 1 - Introducing in the `KMeansSummary` (or `KMeansModel` if you prefer) the cost attribute on the training set 2 - deprecate this method redirecting to `ClusteringEvaluator` for better metrics and/or to the cost attribute introduced What do you think?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org