[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19204


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19204#discussion_r139718610
  
--- Diff: python/pyspark/ml/evaluation.py ---
@@ -328,6 +329,77 @@ def setParams(self, predictionCol="prediction", 
labelCol="label",
 kwargs = self._input_kwargs
 return self._set(**kwargs)
 
+
+@inherit_doc
+class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Evaluator for Clustering results, which expects two input
+columns: prediction and features.
+
+>>> from pyspark.ml.linalg import Vectors
+>>> scoreAndLabels = map(lambda x: (Vectors.dense(x[0]), x[1]),
--- End diff --

```scoreAndLabels``` -> ```featureAndPredictions```, the dataset here is 
different from other evaluators, we should use more accurate name. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-17 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19204#discussion_r139312695
  
--- Diff: python/pyspark/ml/evaluation.py ---
@@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", 
labelCol="label",
 kwargs = self._input_kwargs
 return self._set(**kwargs)
 
+
+@inherit_doc
+class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Evaluator for Clustering results, which expects two input
+columns: prediction and features.
+
+>>> from sklearn import datasets
+>>> from pyspark.sql.types import *
+>>> from pyspark.ml.linalg import Vectors, VectorUDT
+>>> from pyspark.ml.evaluation import ClusteringEvaluator
+...
+>>> iris = datasets.load_iris()
+>>> iris_rows = [(Vectors.dense(x), int(iris.target[i]))
+... for i, x in enumerate(iris.data)]
+>>> schema = StructType([
+...StructField("features", VectorUDT(), True),
+...StructField("cluster_id", IntegerType(), True)])
+>>> rdd = spark.sparkContext.parallelize(iris_rows)
+>>> dataset = spark.createDataFrame(rdd, schema)
+...
+>>> evaluator = ClusteringEvaluator(predictionCol="cluster_id")
+>>> evaluator.evaluate(dataset)
+0.656...
+>>> ce_path = temp_path + "/ce"
+>>> evaluator.save(ce_path)
+>>> evaluator2 = ClusteringEvaluator.load(ce_path)
+>>> str(evaluator2.getPredictionCol())
+'cluster_id'
+
+.. versionadded:: 2.3.0
+"""
+metricName = Param(Params._dummy(), "metricName",
+   "metric name in evaluation (silhouette)",
+   typeConverter=TypeConverters.toString)
+
+@keyword_only
+def __init__(self, predictionCol="prediction", featuresCol="features",
+ metricName="silhouette"):
+"""
+__init__(self, predictionCol="prediction", featuresCol="features", 
\
+ metricName="silhouette")
+"""
+super(ClusteringEvaluator, self).__init__()
+self._java_obj = self._new_java_obj(
+"org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid)
+self._setDefault(predictionCol="prediction", 
featuresCol="features",
--- End diff --

I sent #19262 to fix same issue for other evaluators, please feel free to 
comment. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-17 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19204#discussion_r139312388
  
--- Diff: python/pyspark/ml/evaluation.py ---
@@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", 
labelCol="label",
 kwargs = self._input_kwargs
 return self._set(**kwargs)
 
+
+@inherit_doc
+class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Evaluator for Clustering results, which expects two input
+columns: prediction and features.
+
+>>> from sklearn import datasets
+>>> from pyspark.sql.types import *
+>>> from pyspark.ml.linalg import Vectors, VectorUDT
+>>> from pyspark.ml.evaluation import ClusteringEvaluator
+...
+>>> iris = datasets.load_iris()
+>>> iris_rows = [(Vectors.dense(x), int(iris.target[i]))
+... for i, x in enumerate(iris.data)]
+>>> schema = StructType([
+...StructField("features", VectorUDT(), True),
+...StructField("cluster_id", IntegerType(), True)])
--- End diff --

```cluster_id``` -> ```prediction``` to emphasize this is the prediction 
value, not ground truth.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-17 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19204#discussion_r139312199
  
--- Diff: python/pyspark/ml/evaluation.py ---
@@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", 
labelCol="label",
 kwargs = self._input_kwargs
 return self._set(**kwargs)
 
+
+@inherit_doc
+class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Evaluator for Clustering results, which expects two input
+columns: prediction and features.
+
+>>> from sklearn import datasets
+>>> from pyspark.sql.types import *
+>>> from pyspark.ml.linalg import Vectors, VectorUDT
+>>> from pyspark.ml.evaluation import ClusteringEvaluator
+...
+>>> iris = datasets.load_iris()
+>>> iris_rows = [(Vectors.dense(x), int(iris.target[i]))
+... for i, x in enumerate(iris.data)]
+>>> schema = StructType([
+...StructField("features", VectorUDT(), True),
+...StructField("cluster_id", IntegerType(), True)])
+>>> rdd = spark.sparkContext.parallelize(iris_rows)
+>>> dataset = spark.createDataFrame(rdd, schema)
+...
+>>> evaluator = ClusteringEvaluator(predictionCol="cluster_id")
+>>> evaluator.evaluate(dataset)
+0.656...
+>>> ce_path = temp_path + "/ce"
+>>> evaluator.save(ce_path)
+>>> evaluator2 = ClusteringEvaluator.load(ce_path)
+>>> str(evaluator2.getPredictionCol())
+'cluster_id'
+
+.. versionadded:: 2.3.0
+"""
+metricName = Param(Params._dummy(), "metricName",
+   "metric name in evaluation (silhouette)",
+   typeConverter=TypeConverters.toString)
+
+@keyword_only
+def __init__(self, predictionCol="prediction", featuresCol="features",
+ metricName="silhouette"):
+"""
+__init__(self, predictionCol="prediction", featuresCol="features", 
\
+ metricName="silhouette")
+"""
+super(ClusteringEvaluator, self).__init__()
+self._java_obj = self._new_java_obj(
+"org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid)
+self._setDefault(predictionCol="prediction", 
featuresCol="features",
--- End diff --

Remove setting default value for ```predictionCol``` and ```featuresCol```, 
as they have been set in ```HasPredictionCol``` and ```HasFeaturesCol```.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-17 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19204#discussion_r139312034
  
--- Diff: python/pyspark/ml/evaluation.py ---
@@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", 
labelCol="label",
 kwargs = self._input_kwargs
 return self._set(**kwargs)
 
+
+@inherit_doc
+class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Evaluator for Clustering results, which expects two input
+columns: prediction and features.
+
+>>> from sklearn import datasets
+>>> from pyspark.sql.types import *
+>>> from pyspark.ml.linalg import Vectors, VectorUDT
+>>> from pyspark.ml.evaluation import ClusteringEvaluator
+...
+>>> iris = datasets.load_iris()
--- End diff --

Please don't involves other libraries if not necessary, here the doc test 
is used to show how to use ```ClusteringEvaluator```  to fresh users, so we 
should focus on evaluator and keep it as simple as possible. You can refer 
other evaluator to construct simple dataset.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-17 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19204#discussion_r139312046
  
--- Diff: python/pyspark/ml/evaluation.py ---
@@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", 
labelCol="label",
 kwargs = self._input_kwargs
 return self._set(**kwargs)
 
+
+@inherit_doc
+class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Evaluator for Clustering results, which expects two input
+columns: prediction and features.
+
+>>> from sklearn import datasets
+>>> from pyspark.sql.types import *
+>>> from pyspark.ml.linalg import Vectors, VectorUDT
+>>> from pyspark.ml.evaluation import ClusteringEvaluator
--- End diff --

Remove this, it's not necessary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19204#discussion_r138763970
  
--- Diff: python/pyspark/ml/evaluation.py ---
@@ -328,6 +329,87 @@ def setParams(self, predictionCol="prediction", 
labelCol="label",
 kwargs = self._input_kwargs
 return self._set(**kwargs)
 
+
+@inherit_doc
+class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Evaluator for Clustering results, which expects two input
+columns: prediction and features.
+
+>>> from sklearn import datasets
+>>> from pyspark.sql.types import *
+>>> from pyspark.ml.linalg import Vectors, VectorUDT
+>>> from pyspark.ml.evaluation import ClusteringEvaluator
+...
+>>> iris = datasets.load_iris()
+>>> iris_rows = [(Vectors.dense(x), int(iris.target[i]))
+... for i, x in enumerate(iris.data)]
+>>> schema = StructType([
+...StructField("features", VectorUDT(), True),
+...StructField("cluster_id", IntegerType(), True)])
+>>> rdd = spark.sparkContext.parallelize(iris_rows)
+>>> dataset = spark.createDataFrame(rdd, schema)
+...
+>>> evaluator = ClusteringEvaluator(predictionCol="cluster_id")
+>>> evaluator.evaluate(dataset)
+0.656...
+>>> ce_path = temp_path + "/ce"
+>>> evaluator.save(ce_path)
+>>> evaluator2 = ClusteringEvaluator.load(ce_path)
+>>> str(evaluator2.getPredictionCol())
+'cluster_id'
+
+.. versionadded:: 2.3.0
+"""
+metricName = Param(Params._dummy(), "metricName",
+   "metric name in evaluation "
+   "(silhouette)",
--- End diff --

The string in multiple lines, we should use """ instead of "".  Otherwise 
move them to the same line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...

2017-09-12 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/19204

[SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator

## What changes were proposed in this pull request?

Added Python interface for ClusteringEvaluator

## How was this patch tested?

Manual test, eg. the example Python code in the comments.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-21981

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19204.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19204


commit 31b3c6c7e1298a1b4bf1fc969cee50534970ab0a
Author: Marco Gaido 
Date:   2017-09-05T17:22:21Z

Added python interface for ClusteringEvaluator




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org