[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21563 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21563#discussion_r197611456 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -107,15 +106,18 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str @Since("2.3.0") override def evaluate(dataset: Dataset[_]): Double = { -SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.validateVectorCompatibleColumn(dataset.schema, $(featuresCol)) SchemaUtils.checkNumericType(dataset.schema, $(predictionCol)) +val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol)) +val df = dataset.select(col($(predictionCol)), --- End diff -- we can propose that --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21563#discussion_r197600500 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -107,15 +106,18 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str @Since("2.3.0") override def evaluate(dataset: Dataset[_]): Double = { -SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.validateVectorCompatibleColumn(dataset.schema, $(featuresCol)) SchemaUtils.checkNumericType(dataset.schema, $(predictionCol)) +val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol)) +val df = dataset.select(col($(predictionCol)), --- End diff -- @mgaido91 I think it maybe nice to first add a name getter for column --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21563#discussion_r195664838 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -107,15 +106,18 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str @Since("2.3.0") override def evaluate(dataset: Dataset[_]): Double = { -SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.validateVectorCompatibleColumn(dataset.schema, $(featuresCol)) SchemaUtils.checkNumericType(dataset.schema, $(predictionCol)) +val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol)) +val df = dataset.select(col($(predictionCol)), --- End diff -- we have the new column we are returning, so we can easily get its name with `.name` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21563#discussion_r195618344 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -107,15 +106,18 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str @Since("2.3.0") override def evaluate(dataset: Dataset[_]): Double = { -SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.validateVectorCompatibleColumn(dataset.schema, $(featuresCol)) SchemaUtils.checkNumericType(dataset.schema, $(predictionCol)) +val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol)) +val df = dataset.select(col($(predictionCol)), --- End diff -- @mgaido91 Thanks for your reviewing! I have considered this, however there exists a problem: if we want to append metadata into the transformed column (like using method `.as(alias: String, metadata: Metadata)`) in `DatasetUtils.columnToVector`, how can we get the name of transformed column? The only way to do this I know is: ``` val metadata = ... val vectorCol = .. val vectorName = dataset.select(vectorCol) .schema.head.name vectorCol.as(vectorName, metadata) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21563#discussion_r195389157 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -107,15 +106,18 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str @Since("2.3.0") override def evaluate(dataset: Dataset[_]): Double = { -SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.validateVectorCompatibleColumn(dataset.schema, $(featuresCol)) SchemaUtils.checkNumericType(dataset.schema, $(predictionCol)) +val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol)) +val df = dataset.select(col($(predictionCol)), --- End diff -- not sure this is the right way. Probably we can face the same issue everywhere we are using `DatasetUtils.columnToVector`. Probably it is better to fix the problem there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/21563 [SPARK-24557][ML] ClusteringEvaluator support array input ## What changes were proposed in this pull request? ClusteringEvaluator support array input ## How was this patch tested? added tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark clu_eval_support_array Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21563.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21563 commit b126bd4f410ab4a01bbe7a980042704ea7420c6f Author: éçå³° Date: 2018-06-14T08:15:43Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org