[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19204 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r139718610 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,77 @@ def setParams(self, predictionCol="prediction", labelCol="label", kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from pyspark.ml.linalg import Vectors +>>> scoreAndLabels = map(lambda x: (Vectors.dense(x[0]), x[1]), --- End diff -- ```scoreAndLabels``` -> ```featureAndPredictions```, the dataset here is different from other evaluators, we should use more accurate name. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r139312695 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", labelCol="label", kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from sklearn import datasets +>>> from pyspark.sql.types import * +>>> from pyspark.ml.linalg import Vectors, VectorUDT +>>> from pyspark.ml.evaluation import ClusteringEvaluator +... +>>> iris = datasets.load_iris() +>>> iris_rows = [(Vectors.dense(x), int(iris.target[i])) +... for i, x in enumerate(iris.data)] +>>> schema = StructType([ +...StructField("features", VectorUDT(), True), +...StructField("cluster_id", IntegerType(), True)]) +>>> rdd = spark.sparkContext.parallelize(iris_rows) +>>> dataset = spark.createDataFrame(rdd, schema) +... +>>> evaluator = ClusteringEvaluator(predictionCol="cluster_id") +>>> evaluator.evaluate(dataset) +0.656... +>>> ce_path = temp_path + "/ce" +>>> evaluator.save(ce_path) +>>> evaluator2 = ClusteringEvaluator.load(ce_path) +>>> str(evaluator2.getPredictionCol()) +'cluster_id' + +.. versionadded:: 2.3.0 +""" +metricName = Param(Params._dummy(), "metricName", + "metric name in evaluation (silhouette)", + typeConverter=TypeConverters.toString) + +@keyword_only +def __init__(self, predictionCol="prediction", featuresCol="features", + metricName="silhouette"): +""" +__init__(self, predictionCol="prediction", featuresCol="features", \ + metricName="silhouette") +""" +super(ClusteringEvaluator, self).__init__() +self._java_obj = self._new_java_obj( +"org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid) +self._setDefault(predictionCol="prediction", featuresCol="features", --- End diff -- I sent #19262 to fix same issue for other evaluators, please feel free to comment. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r139312388 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", labelCol="label", kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from sklearn import datasets +>>> from pyspark.sql.types import * +>>> from pyspark.ml.linalg import Vectors, VectorUDT +>>> from pyspark.ml.evaluation import ClusteringEvaluator +... +>>> iris = datasets.load_iris() +>>> iris_rows = [(Vectors.dense(x), int(iris.target[i])) +... for i, x in enumerate(iris.data)] +>>> schema = StructType([ +...StructField("features", VectorUDT(), True), +...StructField("cluster_id", IntegerType(), True)]) --- End diff -- ```cluster_id``` -> ```prediction``` to emphasize this is the prediction value, not ground truth. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r139312199 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", labelCol="label", kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from sklearn import datasets +>>> from pyspark.sql.types import * +>>> from pyspark.ml.linalg import Vectors, VectorUDT +>>> from pyspark.ml.evaluation import ClusteringEvaluator +... +>>> iris = datasets.load_iris() +>>> iris_rows = [(Vectors.dense(x), int(iris.target[i])) +... for i, x in enumerate(iris.data)] +>>> schema = StructType([ +...StructField("features", VectorUDT(), True), +...StructField("cluster_id", IntegerType(), True)]) +>>> rdd = spark.sparkContext.parallelize(iris_rows) +>>> dataset = spark.createDataFrame(rdd, schema) +... +>>> evaluator = ClusteringEvaluator(predictionCol="cluster_id") +>>> evaluator.evaluate(dataset) +0.656... +>>> ce_path = temp_path + "/ce" +>>> evaluator.save(ce_path) +>>> evaluator2 = ClusteringEvaluator.load(ce_path) +>>> str(evaluator2.getPredictionCol()) +'cluster_id' + +.. versionadded:: 2.3.0 +""" +metricName = Param(Params._dummy(), "metricName", + "metric name in evaluation (silhouette)", + typeConverter=TypeConverters.toString) + +@keyword_only +def __init__(self, predictionCol="prediction", featuresCol="features", + metricName="silhouette"): +""" +__init__(self, predictionCol="prediction", featuresCol="features", \ + metricName="silhouette") +""" +super(ClusteringEvaluator, self).__init__() +self._java_obj = self._new_java_obj( +"org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid) +self._setDefault(predictionCol="prediction", featuresCol="features", --- End diff -- Remove setting default value for ```predictionCol``` and ```featuresCol```, as they have been set in ```HasPredictionCol``` and ```HasFeaturesCol```. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r139312034 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", labelCol="label", kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from sklearn import datasets +>>> from pyspark.sql.types import * +>>> from pyspark.ml.linalg import Vectors, VectorUDT +>>> from pyspark.ml.evaluation import ClusteringEvaluator +... +>>> iris = datasets.load_iris() --- End diff -- Please don't involves other libraries if not necessary, here the doc test is used to show how to use ```ClusteringEvaluator``` to fresh users, so we should focus on evaluator and keep it as simple as possible. You can refer other evaluator to construct simple dataset. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r139312046 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", labelCol="label", kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from sklearn import datasets +>>> from pyspark.sql.types import * +>>> from pyspark.ml.linalg import Vectors, VectorUDT +>>> from pyspark.ml.evaluation import ClusteringEvaluator --- End diff -- Remove this, it's not necessary. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r138763970 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,87 @@ def setParams(self, predictionCol="prediction", labelCol="label", kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from sklearn import datasets +>>> from pyspark.sql.types import * +>>> from pyspark.ml.linalg import Vectors, VectorUDT +>>> from pyspark.ml.evaluation import ClusteringEvaluator +... +>>> iris = datasets.load_iris() +>>> iris_rows = [(Vectors.dense(x), int(iris.target[i])) +... for i, x in enumerate(iris.data)] +>>> schema = StructType([ +...StructField("features", VectorUDT(), True), +...StructField("cluster_id", IntegerType(), True)]) +>>> rdd = spark.sparkContext.parallelize(iris_rows) +>>> dataset = spark.createDataFrame(rdd, schema) +... +>>> evaluator = ClusteringEvaluator(predictionCol="cluster_id") +>>> evaluator.evaluate(dataset) +0.656... +>>> ce_path = temp_path + "/ce" +>>> evaluator.save(ce_path) +>>> evaluator2 = ClusteringEvaluator.load(ce_path) +>>> str(evaluator2.getPredictionCol()) +'cluster_id' + +.. versionadded:: 2.3.0 +""" +metricName = Param(Params._dummy(), "metricName", + "metric name in evaluation " + "(silhouette)", --- End diff -- The string in multiple lines, we should use """ instead of "". Otherwise move them to the same line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19204 [SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator ## What changes were proposed in this pull request? Added Python interface for ClusteringEvaluator ## How was this patch tested? Manual test, eg. the example Python code in the comments. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-21981 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19204.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19204 commit 31b3c6c7e1298a1b4bf1fc969cee50534970ab0a Author: Marco GaidoDate: 2017-09-05T17:22:21Z Added python interface for ClusteringEvaluator --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org