[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605639#comment-14605639
 ] 

Rakesh Chalasani commented on SPARK-8587:
-----------------------------------------

Hi Sam,

"computeCost" now returns  the cumulative cost over a dataset, rather than cost 
per sample, which i think this JIRA is for. Internally, predict does compute 
the distance to nearest point but return only the predicted center. So, adding 
a method that returns distances is doing the job twice and that is what is 
pointed above for Bradley. In Pipelines, on the other hand, this can handled 
more gracefully and efficiently by adding a column to the returning DF. 

If that is good for you, can you close this JIRA? I will create another one for 
adding distances to the KMeans pipeline, once that is merged. thanks.

> Return cost and cluster index KMeansModel.predict
> -------------------------------------------------
>
>                 Key: SPARK-8587
>                 URL: https://issues.apache.org/jira/browse/SPARK-8587
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Sam Stoelinga
>            Priority: Minor
>
> Looking at PySpark the implementation of KMeansModel.predict 
> https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
>  : 
> Currently:
> it calculates the cost of the closest cluster and returns the index only.
> My expectation:
> Easy way to let the same function or a new function to return the cost with 
> the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to