[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92594483 Merged into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/5431 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92565573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30212/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92565550 [Test build #30212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30212/consoleFull) for PR 5431 at commit [`d19236d`](https://github.com/apache/spark/commit/d19236dd12d57ced30374e2b1ac22b720b5c42d8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92552866 [Test build #30210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30210/consoleFull) for PR 5431 at commit [`26ae2d7`](https://github.com/apache/spark/commit/26ae2d7898fa598e6fef8a91721992f1d7c793e5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92552876 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30210/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92549813 [Test build #30212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30212/consoleFull) for PR 5431 at commit [`d19236d`](https://github.com/apache/spark/commit/d19236dd12d57ced30374e2b1ac22b720b5c42d8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92548691 LGTM once tests pass --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92544002 [Test build #30210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30210/consoleFull) for PR 5431 at commit [`26ae2d7`](https://github.com/apache/spark/commit/26ae2d7898fa598e6fef8a91721992f1d7c793e5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28291065 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable { /** * Sets a parameter (by name) in the embedded param map. */ - private[ml] def set(param: String, value: Any): this.type = { + protected final def set(param: String, value: Any): this.type = { set(getParam(param), value) } /** - * Gets the value of a parameter in the embedded param map. + * Optionally returns the user-supplied value of a param. + */ + final def get[T](param: Param[T]): Option[T] = { --- End diff -- OK, that sounds fine, including making clear() non-public --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28290995 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -55,58 +49,42 @@ class Param[T] ( */ def ->(value: T): ParamPair[T] = ParamPair(this, value) - override def toString: String = { -if (defaultValue.isDefined) { - s"$name: $doc (default: ${defaultValue.get})" -} else { - s"$name: $doc" -} - } + override def toString: String = s"$name: $doc" --- End diff -- I like having the default. Your call on the current value; that might actually be confusing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28290778 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -325,7 +379,7 @@ class ParamMap private[ml] (private val map: mutable.Map[Param[Any], Any]) exten } /** - * Make a copy of this param map. + * Creates a copy of this param map. */ def copy: ParamMap = new ParamMap(map.clone()) --- End diff -- `()` is used if it changes the content of the instance or triggers IO. Another exception is there are overridden methods and we have to use `()` in Scala. For this case, I think `copy` without `()` is correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28290818 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.param.shared + +import java.io.PrintWriter + +import scala.reflect.ClassTag + +/** + * Code generator for shared params (sharedParams.scala). Run under the Spark folder with + * {{{ + * build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen" + * }}} + */ +private[shared] object SharedParamsCodeGen { + + def main(args: Array[String]): Unit = { +val params = Seq( + ParamDesc[Double]("regParam", "regularization parameter"), + ParamDesc[Int]("maxIter", "max number of iterations"), + ParamDesc[String]("featuresCol", "features column name", Some("\"features\"")), + ParamDesc[String]("labelCol", "label column name", Some("\"label\"")), + ParamDesc[String]("predictionCol", "prediction column name", Some("\"prediction\"")), + ParamDesc[String]("rawPredictionCol", "raw prediction (a.k.a. confidence) column name", +Some("\"rawPrediction\"")), + ParamDesc[String]("probabilityCol", +"column name for predicted class conditional probabilities", Some("\"probability\"")), + ParamDesc[Double]("threshold", "threshold in prediction"), --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28290695 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable { /** * Sets a parameter (by name) in the embedded param map. */ - private[ml] def set(param: String, value: Any): this.type = { + protected final def set(param: String, value: Any): this.type = { set(getParam(param), value) } /** - * Gets the value of a parameter in the embedded param map. + * Optionally returns the user-supplied value of a param. + */ + final def get[T](param: Param[T]): Option[T] = { --- End diff -- I made this change in my last commit. `set` cannot be public because `Model` also extends `Params` and some params should be immutable in models. Maybe `clear` shouldn't be public either. But for get, getDefault, and getOrDefault, I think those could be public. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28290532 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable { /** * Sets a parameter (by name) in the embedded param map. */ - private[ml] def set(param: String, value: Any): this.type = { + protected final def set(param: String, value: Any): this.type = { set(getParam(param), value) } /** - * Gets the value of a parameter in the embedded param map. + * Optionally returns the user-supplied value of a param. + */ + final def get[T](param: Param[T]): Option[T] = { +shouldOwn(param) +paramMap.get(param) + } + + /** + * Clears the user-supplied value for the input param. + */ + final def clear(param: Param[_]): this.type = { +shouldOwn(param) +paramMap.remove(param) +this + } + + /** + * Gets the value of a param in the embedded param map or its default value. Throws an exception + * if neither is set. */ - protected def get[T](param: Param[T]): T = { -require(param.parent.eq(this)) -paramMap(param) + final def getOrDefault[T](param: Param[T]): T = { +shouldOwn(param) +get(param).orElse(getDefault(param)).get } /** - * Internal param map. + * Sets a default value. Make sure that the input param is initialized before this gets called. */ - protected val paramMap: ParamMap = ParamMap.empty + protected final def setDefault[T](param: Param[T], value: T): this.type = { +shouldOwn(param) +defaultParamMap.put(param, value) +this + } /** - * Check whether the given schema contains an input column. - * @param colName Input column name - * @param dataType Input column DataType + * Sets default values. Make sure that the input params are initialized before this gets called. */ - protected def checkInputColumn(schema: StructType, colName: String, dataType: DataType): Unit = { -val actualDataType = schema(colName).dataType -require(actualDataType.equals(dataType), s"Input column $colName must be of type $dataType" + - s" but was actually $actualDataType. Column param description: ${getParam(colName)}") + protected final def setDefault(paramPairs: ParamPair[_]*): this.type = { +paramPairs.foreach { p => + setDefault(p.param.asInstanceOf[Param[Any]], p.value) +} +this } /** - * Add an output column to the given schema. - * This fails if the given output column already exists. - * @param schema Initial schema (not modified) - * @param colName Output column name. If this column name is an empy String "", this method - * returns the initial schema, unchanged. This allows users to disable output - * columns. - * @param dataType Output column DataType - */ - protected def addOutputColumn( - schema: StructType, - colName: String, - dataType: DataType): StructType = { -if (colName.length == 0) return schema -val fieldNames = schema.fieldNames -require(!fieldNames.contains(colName), s"Output column $colName already exists.") -val outputFields = schema.fields ++ Seq(StructField(colName, dataType, nullable = false)) -StructType(outputFields) + * Gets the default value of a parameter. + */ + final def getDefault[T](param: Param[T]): Option[T] = { +shouldOwn(param) +defaultParamMap.get(param) + } + + /** + * Tests whether the input param has a default value set. + */ + final def hasDefault[T](param: Param[T]): Boolean = { +shouldOwn(param) +defaultParamMap.contains(param) + } + + /** + * Extracts the embedded default param values and user-supplied values, and then merges them with + * extra values from input into a flat param map, where the latter value is used if there exist + * conflicts. --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28290414 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable { /** * Sets a parameter (by name) in the embedded param map. */ - private[ml] def set(param: String, value: Any): this.type = { + protected final def set(param: String, value: Any): this.type = { set(getParam(param), value) } /** - * Gets the value of a parameter in the embedded param map. + * Optionally returns the user-supplied value of a param. + */ + final def get[T](param: Param[T]): Option[T] = { +shouldOwn(param) +paramMap.get(param) + } + + /** + * Clears the user-supplied value for the input param. + */ + final def clear(param: Param[_]): this.type = { +shouldOwn(param) +paramMap.remove(param) +this + } + + /** + * Gets the value of a param in the embedded param map or its default value. Throws an exception + * if neither is set. */ - protected def get[T](param: Param[T]): T = { -require(param.parent.eq(this)) -paramMap(param) + final def getOrDefault[T](param: Param[T]): T = { +shouldOwn(param) +get(param).orElse(getDefault(param)).get } /** - * Internal param map. + * Sets a default value. Make sure that the input param is initialized before this gets called. --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28290419 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable { /** * Sets a parameter (by name) in the embedded param map. */ - private[ml] def set(param: String, value: Any): this.type = { + protected final def set(param: String, value: Any): this.type = { set(getParam(param), value) } /** - * Gets the value of a parameter in the embedded param map. + * Optionally returns the user-supplied value of a param. + */ + final def get[T](param: Param[T]): Option[T] = { +shouldOwn(param) +paramMap.get(param) + } + + /** + * Clears the user-supplied value for the input param. + */ + final def clear(param: Param[_]): this.type = { +shouldOwn(param) +paramMap.remove(param) +this + } + + /** + * Gets the value of a param in the embedded param map or its default value. Throws an exception + * if neither is set. */ - protected def get[T](param: Param[T]): T = { -require(param.parent.eq(this)) -paramMap(param) + final def getOrDefault[T](param: Param[T]): T = { +shouldOwn(param) +get(param).orElse(getDefault(param)).get } /** - * Internal param map. + * Sets a default value. Make sure that the input param is initialized before this gets called. */ - protected val paramMap: ParamMap = ParamMap.empty + protected final def setDefault[T](param: Param[T], value: T): this.type = { +shouldOwn(param) +defaultParamMap.put(param, value) +this + } /** - * Check whether the given schema contains an input column. - * @param colName Input column name - * @param dataType Input column DataType + * Sets default values. Make sure that the input params are initialized before this gets called. --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28289791 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -55,58 +49,42 @@ class Param[T] ( */ def ->(value: T): ParamPair[T] = ParamPair(this, value) - override def toString: String = { -if (defaultValue.isDefined) { - s"$name: $doc (default: ${defaultValue.get})" -} else { - s"$name: $doc" -} - } + override def toString: String = s"$name: $doc" --- End diff -- I thought about this option but I'm not sure whether we should put default value and current value in `Param.toString` because they are stored in the parent. But no strong preference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28289673 --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala --- @@ -78,23 +81,42 @@ class ParamsSuite extends FunSuite { } test("params") { --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28289697 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -17,15 +17,16 @@ package org.apache.spark.ml.classification -import org.apache.spark.annotation.{DeveloperApi, AlphaComponent} +import org.apache.spark.annotation.{AlphaComponent, DeveloperApi} import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, PredictorParams} -import org.apache.spark.ml.param.{Params, ParamMap, HasRawPredictionCol} +import org.apache.spark.ml.param.{ParamMap, Params} +import org.apache.spark.ml.param.shared.HasRawPredictionCol +import org.apache.spark.ml.util.SchemaUtils import org.apache.spark.mllib.linalg.{Vector, VectorUDT} -import org.apache.spark.sql.functions._ import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{DataType, DoubleType, StructType} - --- End diff -- No strong preference on this one:) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92513221 @mengxr I added minor comment but don't have major ones. That's it for now! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28283755 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.param.shared + +import java.io.PrintWriter + +import scala.reflect.ClassTag + +/** + * Code generator for shared params (sharedParams.scala). Run under the Spark folder with + * {{{ + * build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen" + * }}} + */ +private[shared] object SharedParamsCodeGen { + + def main(args: Array[String]): Unit = { +val params = Seq( + ParamDesc[Double]("regParam", "regularization parameter"), + ParamDesc[Int]("maxIter", "max number of iterations"), + ParamDesc[String]("featuresCol", "features column name", Some("\"features\"")), + ParamDesc[String]("labelCol", "label column name", Some("\"label\"")), + ParamDesc[String]("predictionCol", "prediction column name", Some("\"prediction\"")), + ParamDesc[String]("rawPredictionCol", "raw prediction (a.k.a. confidence) column name", +Some("\"rawPrediction\"")), + ParamDesc[String]("probabilityCol", +"column name for predicted class conditional probabilities", Some("\"probability\"")), + ParamDesc[Double]("threshold", "threshold in prediction"), --- End diff -- "threshold in prediction" --> "threshold in binary classification" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28283689 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -325,7 +379,7 @@ class ParamMap private[ml] (private val map: mutable.Map[Param[Any], Any]) exten } /** - * Make a copy of this param map. + * Creates a copy of this param map. */ def copy: ParamMap = new ParamMap(map.clone()) --- End diff -- Do we like copy methods to use parentheses or not? I think we're not completely consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28283381 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable { /** * Sets a parameter (by name) in the embedded param map. */ - private[ml] def set(param: String, value: Any): this.type = { + protected final def set(param: String, value: Any): this.type = { set(getParam(param), value) } /** - * Gets the value of a parameter in the embedded param map. + * Optionally returns the user-supplied value of a param. + */ + final def get[T](param: Param[T]): Option[T] = { --- End diff -- Why is get public and set private? Shouldn't both be public (which might be handy for programmatic setting of parameters, such as from config files) or both be private (since implementing classes will provide getters/setters anyways)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28282819 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable { /** * Sets a parameter (by name) in the embedded param map. */ - private[ml] def set(param: String, value: Any): this.type = { + protected final def set(param: String, value: Any): this.type = { set(getParam(param), value) } /** - * Gets the value of a parameter in the embedded param map. + * Optionally returns the user-supplied value of a param. + */ + final def get[T](param: Param[T]): Option[T] = { +shouldOwn(param) +paramMap.get(param) + } + + /** + * Clears the user-supplied value for the input param. + */ + final def clear(param: Param[_]): this.type = { +shouldOwn(param) +paramMap.remove(param) +this + } + + /** + * Gets the value of a param in the embedded param map or its default value. Throws an exception + * if neither is set. */ - protected def get[T](param: Param[T]): T = { -require(param.parent.eq(this)) -paramMap(param) + final def getOrDefault[T](param: Param[T]): T = { +shouldOwn(param) +get(param).orElse(getDefault(param)).get } /** - * Internal param map. + * Sets a default value. Make sure that the input param is initialized before this gets called. */ - protected val paramMap: ParamMap = ParamMap.empty + protected final def setDefault[T](param: Param[T], value: T): this.type = { +shouldOwn(param) +defaultParamMap.put(param, value) +this + } /** - * Check whether the given schema contains an input column. - * @param colName Input column name - * @param dataType Input column DataType + * Sets default values. Make sure that the input params are initialized before this gets called. */ - protected def checkInputColumn(schema: StructType, colName: String, dataType: DataType): Unit = { -val actualDataType = schema(colName).dataType -require(actualDataType.equals(dataType), s"Input column $colName must be of type $dataType" + - s" but was actually $actualDataType. Column param description: ${getParam(colName)}") + protected final def setDefault(paramPairs: ParamPair[_]*): this.type = { +paramPairs.foreach { p => + setDefault(p.param.asInstanceOf[Param[Any]], p.value) +} +this } /** - * Add an output column to the given schema. - * This fails if the given output column already exists. - * @param schema Initial schema (not modified) - * @param colName Output column name. If this column name is an empy String "", this method - * returns the initial schema, unchanged. This allows users to disable output - * columns. - * @param dataType Output column DataType - */ - protected def addOutputColumn( - schema: StructType, - colName: String, - dataType: DataType): StructType = { -if (colName.length == 0) return schema -val fieldNames = schema.fieldNames -require(!fieldNames.contains(colName), s"Output column $colName already exists.") -val outputFields = schema.fields ++ Seq(StructField(colName, dataType, nullable = false)) -StructType(outputFields) + * Gets the default value of a parameter. + */ + final def getDefault[T](param: Param[T]): Option[T] = { +shouldOwn(param) +defaultParamMap.get(param) + } + + /** + * Tests whether the input param has a default value set. + */ + final def hasDefault[T](param: Param[T]): Boolean = { +shouldOwn(param) +defaultParamMap.contains(param) + } + + /** + * Extracts the embedded default param values and user-supplied values, and then merges them with + * extra values from input into a flat param map, where the latter value is used if there exist + * conflicts. --- End diff -- This is a bit confusing. I like specifying the priority ordering: ``` Priority ordering: default params < user-supplied params < extraParamMap ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-un
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28282538 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -55,58 +49,42 @@ class Param[T] ( */ def ->(value: T): ParamPair[T] = ParamPair(this, value) - override def toString: String = { -if (defaultValue.isDefined) { - s"$name: $doc (default: ${defaultValue.get})" -} else { - s"$name: $doc" -} - } + override def toString: String = s"$name: $doc" --- End diff -- Why not print default using parent? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28282519 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -17,15 +17,16 @@ package org.apache.spark.ml.classification -import org.apache.spark.annotation.{DeveloperApi, AlphaComponent} +import org.apache.spark.annotation.{AlphaComponent, DeveloperApi} import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, PredictorParams} -import org.apache.spark.ml.param.{Params, ParamMap, HasRawPredictionCol} +import org.apache.spark.ml.param.{ParamMap, Params} +import org.apache.spark.ml.param.shared.HasRawPredictionCol +import org.apache.spark.ml.util.SchemaUtils import org.apache.spark.mllib.linalg.{Vector, VectorUDT} -import org.apache.spark.sql.functions._ import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{DataType, DoubleType, StructType} - --- End diff -- People's opinions about this differ : ) (See the DataFrames PR) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28282554 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable { /** * Sets a parameter (by name) in the embedded param map. */ - private[ml] def set(param: String, value: Any): this.type = { + protected final def set(param: String, value: Any): this.type = { set(getParam(param), value) } /** - * Gets the value of a parameter in the embedded param map. + * Optionally returns the user-supplied value of a param. + */ + final def get[T](param: Param[T]): Option[T] = { +shouldOwn(param) +paramMap.get(param) + } + + /** + * Clears the user-supplied value for the input param. + */ + final def clear(param: Param[_]): this.type = { +shouldOwn(param) +paramMap.remove(param) +this + } + + /** + * Gets the value of a param in the embedded param map or its default value. Throws an exception + * if neither is set. */ - protected def get[T](param: Param[T]): T = { -require(param.parent.eq(this)) -paramMap(param) + final def getOrDefault[T](param: Param[T]): T = { +shouldOwn(param) +get(param).orElse(getDefault(param)).get } /** - * Internal param map. + * Sets a default value. Make sure that the input param is initialized before this gets called. */ - protected val paramMap: ParamMap = ParamMap.empty + protected final def setDefault[T](param: Param[T], value: T): this.type = { +shouldOwn(param) +defaultParamMap.put(param, value) +this + } /** - * Check whether the given schema contains an input column. - * @param colName Input column name - * @param dataType Input column DataType + * Sets default values. Make sure that the input params are initialized before this gets called. --- End diff -- ditto: move to parameter-specific doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28282550 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -179,52 +179,96 @@ trait Params extends Identifiable with Serializable { /** * Sets a parameter (by name) in the embedded param map. */ - private[ml] def set(param: String, value: Any): this.type = { + protected final def set(param: String, value: Any): this.type = { set(getParam(param), value) } /** - * Gets the value of a parameter in the embedded param map. + * Optionally returns the user-supplied value of a param. + */ + final def get[T](param: Param[T]): Option[T] = { +shouldOwn(param) +paramMap.get(param) + } + + /** + * Clears the user-supplied value for the input param. + */ + final def clear(param: Param[_]): this.type = { +shouldOwn(param) +paramMap.remove(param) +this + } + + /** + * Gets the value of a param in the embedded param map or its default value. Throws an exception + * if neither is set. */ - protected def get[T](param: Param[T]): T = { -require(param.parent.eq(this)) -paramMap(param) + final def getOrDefault[T](param: Param[T]): T = { +shouldOwn(param) +get(param).orElse(getDefault(param)).get } /** - * Internal param map. + * Sets a default value. Make sure that the input param is initialized before this gets called. --- End diff -- "Make sure that the input param is initialized before this gets called." will be unclear to new developers. Maybe move this to parameter-specific doc: ``` @param param: Make sure that this param is initialized before this method gets called. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28282562 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.util + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.sql.types.{DataType, StructField, StructType} + +/** + * :: DeveloperApi :: + * Utils for handling schemas. + */ +@DeveloperApi +object SchemaUtils { + + // TODO: Move the utility methods to SQL. + + /** + * Check whether the given schema contains an column of the required data type. + * @param colName column name + * @param dataType required column data type + */ + def checkColumnType(schema: StructType, colName: String, dataType: DataType): Unit = { +val actualDataType = schema(colName).dataType +require(actualDataType.equals(dataType), + s"Column $colName must be of type $dataType but was actually $actualDataType.") + } + + /** + * Appends a new column to the input schema. This fails if the given output column already exists. + * @param schema input schema + * @param colName new column name. If this column name is en empty string "", this method returns --- End diff -- typo: "en" -> "an" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28282559 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.util + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.sql.types.{DataType, StructField, StructType} + +/** + * :: DeveloperApi :: + * Utils for handling schemas. + */ +@DeveloperApi +object SchemaUtils { + + // TODO: Move the utility methods to SQL. + + /** + * Check whether the given schema contains an column of the required data type. --- End diff -- typo: "an" --> "a" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/5431#discussion_r28282508 --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala --- @@ -78,23 +81,42 @@ class ParamsSuite extends FunSuite { } test("params") { --- End diff -- Add test for "params.clear" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92509721 Should this method in Params be made abstract? ``` def validate(paramMap: ParamMap): Unit = {} ``` I just realized we haven't been using it, and making it abstract would force users to think about it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92255656 [Test build #30150 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30150/consoleFull) for PR 5431 at commit [`eec2264`](https://github.com/apache/spark/commit/eec2264ef8e49a0be5f2b7539c8bba6e4279c01a). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92255674 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30150/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5957][ML] better handling of parameters
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5431#issuecomment-92230105 [Test build #30150 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30150/consoleFull) for PR 5431 at commit [`eec2264`](https://github.com/apache/spark/commit/eec2264ef8e49a0be5f2b7539c8bba6e4279c01a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org