[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20629 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r202436972 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala --- @@ -37,7 +37,7 @@ import org.apache.spark.sql.{Row, SparkSession} */ @Since("0.8.0") class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector], - @Since("2.4.0") val distanceMeasure: String) + @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val trainingCost: Double) --- End diff -- That's awesome, lets try and get that fixed up before 2.4 goes so we don't have to any workarounds :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r202425169 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala --- @@ -37,7 +37,7 @@ import org.apache.spark.sql.{Row, SparkSession} */ @Since("0.8.0") class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector], - @Since("2.4.0") val distanceMeasure: String) + @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val trainingCost: Double) --- End diff -- This constructor was introduced by a previous PR for 2.4, so this was never out (therefore I think we don't need to keep it). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r202423675 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala --- @@ -37,7 +37,7 @@ import org.apache.spark.sql.{Row, SparkSession} */ @Since("0.8.0") class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector], - @Since("2.4.0") val distanceMeasure: String) + @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val trainingCost: Double) --- End diff -- Since we changed the constructor here, and since it is not private, we should provide a similar (and deprecated) constructor without training cost which calls this with the default value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r195340437 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str /** * param for metric name in evaluation - * (supports `"silhouette"` (default)) + * (supports `"silhouette"` (default), `"kmeansCost"`) --- End diff -- ok, I agree. Let's go on this way then, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r195153337 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str /** * param for metric name in evaluation - * (supports `"silhouette"` (default)) + * (supports `"silhouette"` (default), `"kmeansCost"`) --- End diff -- Generally speaking I think it would make sense to maintain the fall-back metric until at least Spark 3.0 at which point I think it would make sense to ask on the user and dev lists and see if anyone is hard dependencies on it or if it is safe to remove. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r181547119 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str /** * param for metric name in evaluation - * (supports `"silhouette"` (default)) + * (supports `"silhouette"` (default), `"kmeansCost"`) --- End diff -- but does it make sense to introduce something which is already considered legacy when introduced? I think this brigs up again the question: shall we maintain a metric which was introduced only temporary as a fallback due to the lack of better metrics? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r181547107 --- Diff: python/pyspark/ml/clustering.py --- @@ -322,7 +323,11 @@ def computeCost(self, dataset): """ Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data. + +..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead. """ +warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator" --- End diff -- yes, or I can also update it here once we establish for sure what the new API has to look like, as you prefer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r181471237 --- Diff: python/pyspark/ml/clustering.py --- @@ -322,7 +323,11 @@ def computeCost(self, dataset): """ Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data. + +..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead. """ +warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator" --- End diff -- If we do go this path we need to file a follow up JIRA to update Python ClusteringEvaluator. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r181470189 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str /** * param for metric name in evaluation - * (supports `"silhouette"` (default)) + * (supports `"silhouette"` (default), `"kmeansCost"`) --- End diff -- If we want to consider kmeansCost a legacy function lets call it out as such so new people don't start adding a hard dependency to it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/20629#discussion_r181470948 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -84,12 +85,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str /** * param for distance measure to be used in evaluation - * (supports `"squaredEuclidean"` (default), `"cosine"`) + * (supports `"squaredEuclidean"` (default), `"cosine"`, `"euclidean"`) --- End diff -- If some models only support some ditance measures we should make that clear in the docs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20629 [SPARK-23451][ML] Deprecate KMeans.computeCost ## What changes were proposed in this pull request? Deprecate `KMeans.computeCost` which was introduced as a temp fix and now it is not needed anymore, since we introduced `ClusteringEvaluator`. ## How was this patch tested? manual test (deprecation warning displayed) Scala ``` ... scala> model.computeCost(dataset) warning: there was one deprecation warning; re-run with -deprecation for details res1: Double = 0.0 ``` Python ``` >>> import warnings >>> warnings.simplefilter('always', DeprecationWarning) ... >>> model.computeCost(df) /Users/mgaido/apache/spark/python/pyspark/ml/clustering.py:330: DeprecationWarning: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead. " instead.", DeprecationWarning) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23451 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20629.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20629 commit 2f79bb2d5c7e29e85a4a7abe63254d392a49fe53 Author: Marco GaidoDate: 2018-02-16T16:03:09Z [SPARK-23451][ML] Deprecate KMeans.computeCost --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org