Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    @holdenk I am not sure about requiring or not cluster centers for this 
metric. On one side, since the `ClusteringEvaluator` should be a general 
interface for all clustering models and some of them don't provide cluster 
centers, it may be a good idea to compute them if necessary. On the other, does 
this metric make sense for any model other than KMeans? And computing the 
centers of the test dataset would lead to different results than the old API we 
are replacing. So I am not sure it is the right thing to do.
    
    Honestly, the more we go on the more my feeling is that we don't really 
need to move that metric here. We can just deprecate it saying that there are 
better metrics for evaluating a clustering available in the 
`ClusteringEvaluator` (namely the silhouette). In these way people can move 
away from using this metric.
    
    Moreover, sklearn - which is one of the most widespread tool - doesn't 
offer the ability of computing such a cost 
(http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation).
 The only thing sklearn offers is what it calls `inertia` 
(https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/cluster/k_means_.py#L265),
 ie. the cost computed on the training set.
    
    So, I think the best option would be to follow what sklearn does:
    
     1 - Introducing in the `KMeansSummary` (or `KMeansModel` if you prefer) 
the cost attribute on the training set
     2 - deprecate this method redirecting to `ClusteringEvaluator` for better 
metrics and/or to the cost attribute introduced
    
    What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to