[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...

2018-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21563


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...

2018-06-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21563#discussion_r197611456
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -107,15 +106,18 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   @Since("2.3.0")
   override def evaluate(dataset: Dataset[_]): Double = {
-SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.validateVectorCompatibleColumn(dataset.schema, 
$(featuresCol))
 SchemaUtils.checkNumericType(dataset.schema, $(predictionCol))
 
+val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol))
+val df = dataset.select(col($(predictionCol)),
--- End diff --

we can propose that


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...

2018-06-22 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/21563#discussion_r197600500
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -107,15 +106,18 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   @Since("2.3.0")
   override def evaluate(dataset: Dataset[_]): Double = {
-SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.validateVectorCompatibleColumn(dataset.schema, 
$(featuresCol))
 SchemaUtils.checkNumericType(dataset.schema, $(predictionCol))
 
+val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol))
+val df = dataset.select(col($(predictionCol)),
--- End diff --

@mgaido91  I think it maybe nice to first add a name getter for column


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...

2018-06-15 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21563#discussion_r195664838
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -107,15 +106,18 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   @Since("2.3.0")
   override def evaluate(dataset: Dataset[_]): Double = {
-SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.validateVectorCompatibleColumn(dataset.schema, 
$(featuresCol))
 SchemaUtils.checkNumericType(dataset.schema, $(predictionCol))
 
+val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol))
+val df = dataset.select(col($(predictionCol)),
--- End diff --

we have the new column we are returning, so we can easily get its name with 
`.name`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...

2018-06-14 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/21563#discussion_r195618344
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -107,15 +106,18 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   @Since("2.3.0")
   override def evaluate(dataset: Dataset[_]): Double = {
-SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.validateVectorCompatibleColumn(dataset.schema, 
$(featuresCol))
 SchemaUtils.checkNumericType(dataset.schema, $(predictionCol))
 
+val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol))
+val df = dataset.select(col($(predictionCol)),
--- End diff --

@mgaido91  Thanks for your reviewing!
I have considered this, however there exists a problem:
if we want to append metadata into the transformed column (like using 
method `.as(alias: String, metadata: Metadata)`) in 
`DatasetUtils.columnToVector`, how can we get the name of transformed column?
The only way to do this I know is:
```
val metadata = ...
val vectorCol = ..
val vectorName = dataset.select(vectorCol) .schema.head.name
vectorCol.as(vectorName, metadata)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...

2018-06-14 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21563#discussion_r195389157
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -107,15 +106,18 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
 
   @Since("2.3.0")
   override def evaluate(dataset: Dataset[_]): Double = {
-SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.validateVectorCompatibleColumn(dataset.schema, 
$(featuresCol))
 SchemaUtils.checkNumericType(dataset.schema, $(predictionCol))
 
+val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol))
+val df = dataset.select(col($(predictionCol)),
--- End diff --

not sure this is the right way. Probably we can face the same issue 
everywhere we are using `DatasetUtils.columnToVector`. Probably it is better to 
fix the problem there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...

2018-06-14 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/21563

[SPARK-24557][ML] ClusteringEvaluator support array input

## What changes were proposed in this pull request?
ClusteringEvaluator support array input

## How was this patch tested?
added tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark clu_eval_support_array

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21563.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21563


commit b126bd4f410ab4a01bbe7a980042704ea7420c6f
Author: 郑瑞峰 
Date:   2018-06-14T08:15:43Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org