[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-07-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20629


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-07-13 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r202436972
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala ---
@@ -37,7 +37,7 @@ import org.apache.spark.sql.{Row, SparkSession}
  */
 @Since("0.8.0")
 class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: 
Array[Vector],
-  @Since("2.4.0") val distanceMeasure: String)
+  @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val 
trainingCost: Double)
--- End diff --

That's awesome, lets try and get that fixed up before 2.4 goes so we don't 
have to any workarounds :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-07-13 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r202425169
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala ---
@@ -37,7 +37,7 @@ import org.apache.spark.sql.{Row, SparkSession}
  */
 @Since("0.8.0")
 class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: 
Array[Vector],
-  @Since("2.4.0") val distanceMeasure: String)
+  @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val 
trainingCost: Double)
--- End diff --

This constructor was introduced by a previous PR for 2.4, so this was never 
out (therefore I think we don't need to keep it).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-07-13 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r202423675
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala ---
@@ -37,7 +37,7 @@ import org.apache.spark.sql.{Row, SparkSession}
  */
 @Since("0.8.0")
 class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: 
Array[Vector],
-  @Since("2.4.0") val distanceMeasure: String)
+  @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val 
trainingCost: Double)
--- End diff --

Since we changed the constructor here, and since it is not private, we 
should provide a similar (and deprecated) constructor without training cost 
which calls this with the default value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-06-14 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r195340437
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   /**
* param for metric name in evaluation
-   * (supports `"silhouette"` (default))
+   * (supports `"silhouette"` (default), `"kmeansCost"`)
--- End diff --

ok, I agree. Let's go on this way then, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-06-13 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r195153337
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   /**
* param for metric name in evaluation
-   * (supports `"silhouette"` (default))
+   * (supports `"silhouette"` (default), `"kmeansCost"`)
--- End diff --

Generally speaking I think it would make sense to maintain the fall-back 
metric until at least Spark 3.0 at which point I think it would make sense to 
ask on the user and dev lists and see if anyone is hard dependencies on it or 
if it is safe to remove.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-04-14 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r181547119
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   /**
* param for metric name in evaluation
-   * (supports `"silhouette"` (default))
+   * (supports `"silhouette"` (default), `"kmeansCost"`)
--- End diff --

but does it make sense to introduce something which is already considered 
legacy when introduced?

I think this brigs up again the question: shall we maintain a metric which 
was introduced only temporary as a fallback due to the lack of better metrics?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-04-14 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r181547107
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -322,7 +323,11 @@ def computeCost(self, dataset):
 """
 Return the K-means cost (sum of squared distances of points to 
their nearest center)
 for this model on the given data.
+
+..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use 
ClusteringEvaluator instead.
 """
+warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. 
Use ClusteringEvaluator"
--- End diff --

yes, or I can also update it here once we establish for sure what the new 
API has to look like, as you prefer.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-04-13 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r181471237
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -322,7 +323,11 @@ def computeCost(self, dataset):
 """
 Return the K-means cost (sum of squared distances of points to 
their nearest center)
 for this model on the given data.
+
+..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use 
ClusteringEvaluator instead.
 """
+warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. 
Use ClusteringEvaluator"
--- End diff --

If we do go this path we need to file a follow up JIRA to update Python 
ClusteringEvaluator.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-04-13 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r181470189
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   /**
* param for metric name in evaluation
-   * (supports `"silhouette"` (default))
+   * (supports `"silhouette"` (default), `"kmeansCost"`)
--- End diff --

If we want to consider kmeansCost a legacy function lets call it out as 
such so new people don't start adding a hard dependency to it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-04-13 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20629#discussion_r181470948
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -84,12 +85,12 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   /**
* param for distance measure to be used in evaluation
-   * (supports `"squaredEuclidean"` (default), `"cosine"`)
+   * (supports `"squaredEuclidean"` (default), `"cosine"`, `"euclidean"`)
--- End diff --

If some models only support some ditance measures we should make that clear 
in the docs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-02-16 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20629

[SPARK-23451][ML] Deprecate KMeans.computeCost

## What changes were proposed in this pull request?

Deprecate `KMeans.computeCost` which was introduced as a temp fix and now 
it is not needed anymore, since we introduced `ClusteringEvaluator`.

## How was this patch tested?

manual test (deprecation warning displayed)
Scala
```
...
scala> model.computeCost(dataset)
warning: there was one deprecation warning; re-run with -deprecation for 
details
res1: Double = 0.0
```

Python
```
>>> import warnings
>>> warnings.simplefilter('always', DeprecationWarning)
...
>>> model.computeCost(df)
/Users/mgaido/apache/spark/python/pyspark/ml/clustering.py:330: 
DeprecationWarning: Deprecated in 2.4.0. It will be removed in 3.0.0. Use 
ClusteringEvaluator instead.
  " instead.", DeprecationWarning)
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23451

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20629.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20629


commit 2f79bb2d5c7e29e85a4a7abe63254d392a49fe53
Author: Marco Gaido 
Date:   2018-02-16T16:03:09Z

[SPARK-23451][ML] Deprecate KMeans.computeCost




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org