[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605639#comment-14605639 ]
Rakesh Chalasani commented on SPARK-8587: ----------------------------------------- Hi Sam, "computeCost" now returns the cumulative cost over a dataset, rather than cost per sample, which i think this JIRA is for. Internally, predict does compute the distance to nearest point but return only the predicted center. So, adding a method that returns distances is doing the job twice and that is what is pointed above for Bradley. In Pipelines, on the other hand, this can handled more gracefully and efficiently by adding a column to the returning DF. If that is good for you, can you close this JIRA? I will create another one for adding distances to the KMeans pipeline, once that is merged. thanks. > Return cost and cluster index KMeansModel.predict > ------------------------------------------------- > > Key: SPARK-8587 > URL: https://issues.apache.org/jira/browse/SPARK-8587 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Sam Stoelinga > Priority: Minor > > Looking at PySpark the implementation of KMeansModel.predict > https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 > : > Currently: > it calculates the cost of the closest cluster and returns the index only. > My expectation: > Easy way to let the same function or a new function to return the cost with > the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org