[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19993 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r163562784 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala --- @@ -401,15 +390,24 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa } } - test("Both inputCol and inputCols are set") { -val bucket = new Bucketizer() - .setInputCol("feature1") - .setOutputCol("result") - .setSplits(Array(-0.5, 0.0, 0.5)) - .setInputCols(Array("feature1", "feature2")) - -// When both are set, we ignore `inputCols` and just map the column specified by `inputCol`. -assert(bucket.isBucketizeMultipleColumns() == false) + test("assert exception is thrown if both multi-column and single-column params are set") { +val df = Seq((0.5, 0.3), (0.5, -0.4)).toDF("feature1", "feature2") +ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", "feature1"), + ("inputCols", Array("feature1", "feature2"))) +ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", "feature1"), + ("outputCol", "result1"), ("splits", Array(-0.5, 0.0, 0.5)), + ("outputCols", Array("result1", "result2"))) +ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", "feature1"), + ("outputCol", "result1"), ("splits", Array(-0.5, 0.0, 0.5)), + ("splitsArray", Array(Array(-0.5, 0.0, 0.5), Array(-0.5, 0.0, 0.5 + +// this should fail because at least one of inputCol and inputCols must be set +ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "feature1"), + ("splits", Array(-0.5, 0.0, 0.5))) + +// the following should fail because not all the params are set --- End diff -- Technically here we should probably also test the `inputCols` + `outputCols` case (i.e. that not setting `splitsArray` also throws an exception). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r163561075 --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala --- @@ -20,8 +20,11 @@ package org.apache.spark.ml.param import java.io.{ByteArrayOutputStream, ObjectOutputStream} import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.{Estimator, Transformer} import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.ml.param.shared.{HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} --- End diff -- I don't think these are used any longer? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r162955686 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala --- @@ -401,15 +390,14 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa } } - test("Both inputCol and inputCols are set") { -val bucket = new Bucketizer() - .setInputCol("feature1") - .setOutputCol("result") - .setSplits(Array(-0.5, 0.0, 0.5)) - .setInputCols(Array("feature1", "feature2")) - -// When both are set, we ignore `inputCols` and just map the column specified by `inputCol`. -assert(bucket.isBucketizeMultipleColumns() == false) + test("assert exception is thrown if both multi-column and single-column params are set") { +val df = Seq((0.5, 0.3), (0.5, -0.4)).toDF("feature1", "feature2") +ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", "feature1"), + ("inputCols", Array("feature1", "feature2"))) +ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "result1"), + ("outputCols", Array("result1", "result2"))) +ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", Array(-0.5, 0.0, 0.5)), --- End diff -- @MLnick actually it will fail for both reasons. We can add more test cases to check each of these two cases if you think it is needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r162940665 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala --- @@ -401,15 +390,14 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa } } - test("Both inputCol and inputCols are set") { -val bucket = new Bucketizer() - .setInputCol("feature1") - .setOutputCol("result") - .setSplits(Array(-0.5, 0.0, 0.5)) - .setInputCols(Array("feature1", "feature2")) - -// When both are set, we ignore `inputCols` and just map the column specified by `inputCol`. -assert(bucket.isBucketizeMultipleColumns() == false) + test("assert exception is thrown if both multi-column and single-column params are set") { +val df = Seq((0.5, 0.3), (0.5, -0.4)).toDF("feature1", "feature2") +ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", "feature1"), + ("inputCols", Array("feature1", "feature2"))) +ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "result1"), + ("outputCols", Array("result1", "result2"))) +ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", Array(-0.5, 0.0, 0.5)), --- End diff -- Only comment I have is that I believe this line is not testing what you may think. As I read the [checkSingleVsMultiColumnParams](https://github.com/apache/spark/pull/19993/files#diff-72f95a0938e5a140d5126f06bdc381a6R266) method, in this test case it will throw the error, _not_ because both `splits` and `splitsArray` are set, but rather because both `inputCol` & `inputCols` are _unset_. Actually it applies to the line above too. @jkbradley --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r162747183 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -201,9 +184,13 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { -if (isBucketizeMultipleColumns()) { +ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols") --- End diff -- I see. I'll see if I can come up with something which is generic but handles these other checks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r162719519 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -201,9 +184,13 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { -if (isBucketizeMultipleColumns()) { +ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols") --- End diff -- my initial implementation (with @hhbyyh's comments) was more generic and checked what you said. After, @MLnick and @viirya asked to switch to a more generic approach which is the current you see. I'm fine with either of those, but I think we need to choose one way and go in that direction, otherwise we just loose time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r162717263 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -166,6 +167,8 @@ private[ml] object Param { @DeveloperApi object ParamValidators { + private val LOGGER = LoggerFactory.getLogger(ParamValidators.getClass) --- End diff -- Let's switch this to use the Logging trait, to match other MLlib patterns. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r162717142 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -201,9 +184,13 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { -if (isBucketizeMultipleColumns()) { +ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols") --- End diff -- The problem with trying to use a general method like this is that it's hard to capture model-specific requirements. This currently misses checking to make sure that exactly one (not just <= 1) of each pair is available, plus that all of the single-column OR all of the multi-column Params are available. (The same issue occurs in https://github.com/apache/spark/pull/20146 ) It will also be hard to check these items and account for defaults. I'd argue that it's not worth trying to use generic checking functions here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r162044954 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- @MLnick Yes. I didn't test the method posted here. The model possibly doesn't have the params, so we need to check it with `model.hasParam`. Please use the method in #20146. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r162043704 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- @viirya your actual method in #20146 is slightly different (see [here](https://github.com/apache/spark/pull/20146/files#diff-72f95a0938e5a140d5126f06bdc381a6R254)). Is that the best version to use? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r162042318 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- I think @viirya's method is simpler and more general, so why not use it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r161699705 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- @MLnick @viirya in order to address https://github.com/apache/spark/pull/19993/files#r161682506, I was thinking to let this method as it is (just renaming it as per @viirya suggestion) and only adding an `additionalExclusiveParams: (String, String)*` argument to the function. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r161685200 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- If this method looks good to you, maybe you can just copy it from #20146 to use here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r161684970 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- Based on https://github.com/apache/spark/pull/20146#issuecomment-357068054 from @WeichenXu123, I think #20146 cannot get merged for 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r161681586 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- I am not sure if #20146 will get merged for `2.3` - but I think we must merge this PR for `2.3` because I'd prefer not to have this inconsistency in param error handling between `QuantileDiscretizer` and `Bucketizer`. This is a relatively small change, so we can merge it into the branch if we move it quickly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r161682506 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala --- @@ -401,15 +390,9 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa } } - test("Both inputCol and inputCols are set") { -val bucket = new Bucketizer() - .setInputCol("feature1") - .setOutputCol("result") - .setSplits(Array(-0.5, 0.0, 0.5)) - .setInputCols(Array("feature1", "feature2")) - -// When both are set, we ignore `inputCols` and just map the column specified by `inputCol`. -assert(bucket.isBucketizeMultipleColumns() == false) + test("assert exception is thrown if both multi-column and single-column params are set") { --- End diff -- We should also test the other exclusive params (input cols and splits params) as per https://github.com/apache/spark/pull/19993/files#r159133936 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r160143590 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- I think we can use that method once merged, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r159579102 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- I added this method too in #20146. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r159133936 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- If we need to check other exclusive params, e.g., `inputCol` and `splitsArray` or `inputCols` and `splits`, why not just have a method like: ```scala def checkExclusiveParams(model: Params, params: String*): Unit = { if (params.filter(model.isSet(_)).size > 1) { val paramString = params.mkString("`", "`, `", "`") throw new IllegalArgumentException(s"$paramString are exclusive, but more than one among them are set.") } } ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols") ParamValidators.checkExclusiveParams(this, "outputCol", "outputCols") ParamValidators.checkExclusiveParams(this, "inputCol", "splitsArray") ParamValidators.checkExclusiveParams(this, "inputCols", "splits") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r159100390 --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala --- @@ -430,4 +433,45 @@ object ParamsSuite extends SparkFunSuite { require(copyReturnType === obj.getClass, s"${clazz.getName}.copy should return ${clazz.getName} instead of ${copyReturnType.getName}.") } + + /** + * Checks that the class throws an exception in case both `inputCols` and `inputCol` are set and + * in case both `outputCols` and `outputCol` are set. + * These checks are performed only whether the class extends respectively both `HasInputCols` and --- End diff -- only whether" -> "only when" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r159100191 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} + } + + private[spark] def raiseIncompatibleParamsException( + paramName1: String, + paramName2: String): Unit = { +throw new IllegalArgumentException(s"`$paramName1` and `$paramName2` cannot be both set.") --- End diff -- "cannot be both" -> "cannot both be" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r159099688 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -137,18 +137,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String /** * Determines whether this `Bucketizer` is going to map multiple columns. If and only if * `inputCols` is set, it will map multiple columns. Otherwise, it just maps a column specified - * by `inputCol`. A warning will be printed if both are set. + * by `inputCol`. */ private[feature] def isBucketizeMultipleColumns(): Boolean = { -if (isSet(inputCols) && isSet(inputCol)) { - logWarning("Both `inputCol` and `inputCols` are set, we ignore `inputCols` and this " + -"`Bucketizer` only map one column specified by `inputCol`") - false -} else if (isSet(inputCols)) { - true -} else { - false -} +isSet(inputCols) --- End diff -- Seems superfluous to how have a separate method for this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r159100299 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala --- @@ -401,15 +401,9 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa } } - test("Both inputCol and inputCols are set") { -val bucket = new Bucketizer() - .setInputCol("feature1") - .setOutputCol("result") - .setSplits(Array(-0.5, 0.0, 0.5)) - .setInputCols(Array("feature1", "feature2")) - -// When both are set, we ignore `inputCols` and just map the column specified by `inputCol`. -assert(bucket.isBucketizeMultipleColumns() == false) + test("assert exception is thrown is both multi-column and single-column params are set") { --- End diff -- "is thrown is both" -> "is thrown if both" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r158344862 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,31 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("outputCols", "inputCol") + case m: HasInputCols with HasOutputCol if m.isSet(m.inputCols) && m.isSet(m.outputCol) => --- End diff -- Sorry to miss it, but I just found that FeatureHasher has both InputCols and OutputCol. I think we can remove the case and the one above since they can be too strict. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r158170154 --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala --- @@ -430,4 +433,49 @@ object ParamsSuite extends SparkFunSuite { require(copyReturnType === obj.getClass, s"${clazz.getName}.copy should return ${clazz.getName} instead of ${copyReturnType.getName}.") } + + /** + * Checks that the class throws an exception in case both `inputCols` and `inputCol` are set and + * in case both `outputCols` and `outputCol` are set. + * These checks are performed only whether the class extends respectively both `HasInputCols` and + * `HasInputCol` and both `HasOutputCols` and `HasOutputCol`. + * + * @param paramsClass The Class to be checked + * @param spark A `SparkSession` instance to use + */ + def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: SparkSession): Unit = { +import spark.implicits._ +// create fake input Dataset +val feature1 = Array(-1.0, 0.0, 1.0) +val feature2 = Array(1.0, 0.0, -1.0) +val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2") --- End diff -- Send column names as parameter if necessary. We need to ensure the test utility can be used by most transformers with multiple column support. > I think that the type check is performed later, thus it is not a problem here. I don't quite get it, in either `transform` or `fit` the data type will be checked and they will trigger exceptions. > I preferred to use paramsClass: Class[_ <: Params] I'm only thinking about the case that default constructor is not sufficient to create a working Estimator/Transformer. If that's not a concern, then reflection is OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r158158375 --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala --- @@ -430,4 +433,49 @@ object ParamsSuite extends SparkFunSuite { require(copyReturnType === obj.getClass, s"${clazz.getName}.copy should return ${clazz.getName} instead of ${copyReturnType.getName}.") } + + /** + * Checks that the class throws an exception in case both `inputCols` and `inputCol` are set and + * in case both `outputCols` and `outputCol` are set. + * These checks are performed only whether the class extends respectively both `HasInputCols` and + * `HasInputCol` and both `HasOutputCols` and `HasOutputCol`. + * + * @param paramsClass The Class to be checked + * @param spark A `SparkSession` instance to use + */ + def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: SparkSession): Unit = { +import spark.implicits._ +// create fake input Dataset +val feature1 = Array(-1.0, 0.0, 1.0) +val feature2 = Array(1.0, 0.0, -1.0) +val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2") --- End diff -- The reason why I created the dataframe inside the method was to control the names of the columns it has. Otherwise we can't ensure that those columns exist. I think that the type check is performed later, thus it is not a problem here. What do you think? I preferred to use `paramsClass: Class[_ <: Params]` because I need a clean instance for each of the two checks: if an instance is passed I cannot enforce that it is clean, ie. some parameters weren't already set and I would need to copy it to create new instances as well, since otherwise the second check would be influenced by the first one. What do you think? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r158154050 --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala --- @@ -430,4 +433,49 @@ object ParamsSuite extends SparkFunSuite { require(copyReturnType === obj.getClass, s"${clazz.getName}.copy should return ${clazz.getName} instead of ${copyReturnType.getName}.") } + + /** + * Checks that the class throws an exception in case both `inputCols` and `inputCol` are set and + * in case both `outputCols` and `outputCol` are set. + * These checks are performed only whether the class extends respectively both `HasInputCols` and + * `HasInputCol` and both `HasOutputCols` and `HasOutputCol`. + * + * @param paramsClass The Class to be checked + * @param spark A `SparkSession` instance to use + */ + def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: SparkSession): Unit = { +import spark.implicits._ +// create fake input Dataset +val feature1 = Array(-1.0, 0.0, 1.0) +val feature2 = Array(1.0, 0.0, -1.0) +val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2") --- End diff -- I don't think the DataFrame here can be used for other transformers like StringIndexer. How about add a Dataset as function parameter? And is it possible to use an instance `obj: Params` rather than `paramsClass: Class[_ <: Params]` as parameter, just to be more flexible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r158153277 --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala --- @@ -430,4 +433,49 @@ object ParamsSuite extends SparkFunSuite { require(copyReturnType === obj.getClass, s"${clazz.getName}.copy should return ${clazz.getName} instead of ${copyReturnType.getName}.") } + + /** + * Checks that the class throws an exception in case both `inputCols` and `inputCol` are set and + * in case both `outputCols` and `outputCol` are set. + * These checks are performed only whether the class extends respectively both `HasInputCols` and + * `HasInputCol` and both `HasOutputCols` and `HasOutputCol`. + * + * @param paramsClass The Class to be checked + * @param spark A `SparkSession` instance to use + */ + def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: SparkSession): Unit = { --- End diff -- suggestion for function name: assertColOrCols --> testMultiColumnParams --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r158153048 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,31 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def assertColOrCols(model: Params): Unit = { --- End diff -- suggestion for function name: assertColOrCols --> checkMultiColumnParams --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r157870176 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,29 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + def assertColOrCols(model: Params): Unit = { --- End diff -- private[spark] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r157871042 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,29 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + def assertColOrCols(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("outputCols", "inputCol") + case m: HasInputCols with HasOutputCol if m.isSet(m.inputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("inputCols", "outputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} + } + + def raiseIncompatibleParamsException(paramName1: String, paramName2: String): Unit = { +throw new IllegalArgumentException(s"Both `$paramName1` and `$paramName2` are set.") --- End diff -- Error message can be more straight forward. e.g. `$paramName1` and `$paramName2` cannot be set simultaneously. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r157867496 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -137,18 +137,17 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String /** * Determines whether this `Bucketizer` is going to map multiple columns. If and only if * `inputCols` is set, it will map multiple columns. Otherwise, it just maps a column specified - * by `inputCol`. A warning will be printed if both are set. + * by `inputCol`. An exception will be thrown if both are set. */ private[feature] def isBucketizeMultipleColumns(): Boolean = { -if (isSet(inputCols) && isSet(inputCol)) { - logWarning("Both `inputCol` and `inputCols` are set, we ignore `inputCols` and this " + -"`Bucketizer` only map one column specified by `inputCol`") - false -} else if (isSet(inputCols)) { - true -} else { - false +ParamValidators.assertColOrCols(this) +if (isSet(inputCol) && isSet(splitsArray)) { --- End diff -- I noticed `isBucketizeMultipleColumns` is invoked in many places and maybe we can put the checks in other places like transformSchema. It also makes the code consistent with function name. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r157870214 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,29 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + def assertColOrCols(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("outputCols", "inputCol") + case m: HasInputCols with HasOutputCol if m.isSet(m.inputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("inputCols", "outputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} + } + + def raiseIncompatibleParamsException(paramName1: String, paramName2: String): Unit = { --- End diff -- private[spark] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r157869596 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,29 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + def assertColOrCols(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && m.isSet(m.inputCol) => --- End diff -- This may not necessarily be an error for some classes, but we can keep it for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r157684803 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String * by `inputCol`. A warning will be printed if both are set. --- End diff -- We should modify the document above too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r157414340 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String * by `inputCol`. A warning will be printed if both are set. */ private[feature] def isBucketizeMultipleColumns(): Boolean = { -if (isSet(inputCols) && isSet(inputCol)) { - logWarning("Both `inputCol` and `inputCols` are set, we ignore `inputCols` and this " + -"`Bucketizer` only map one column specified by `inputCol`") - false +if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && isSet(outputCol) || + isSet(inputCol) && isSet(outputCols)) { + throw new IllegalArgumentException("Both `inputCol` and `inputCols` are set, `Bucketizer` " + +"only supports setting either `inputCol` or `inputCols`.") --- End diff -- thanks. I will add these checks while doing the changes requested by @hhbyyh. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r157393913 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String * by `inputCol`. A warning will be printed if both are set. */ private[feature] def isBucketizeMultipleColumns(): Boolean = { -if (isSet(inputCols) && isSet(inputCol)) { - logWarning("Both `inputCol` and `inputCols` are set, we ignore `inputCols` and this " + -"`Bucketizer` only map one column specified by `inputCol`") - false +if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && isSet(outputCol) || + isSet(inputCol) && isSet(outputCols)) { + throw new IllegalArgumentException("Both `inputCol` and `inputCols` are set, `Bucketizer` " + +"only supports setting either `inputCol` or `inputCols`.") --- End diff -- Here it is better to add one more check: if single column, only set `splits` param. if multiple column, only set `splitsArray` param. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19993 [SPARK-22799][ML] Bucketizer should throw exception if single- and multi-column params are both set ## What changes were proposed in this pull request? Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception. The PR throws an exception in `Bucketizer`, instead of logging a warning. ## How was this patch tested? modified UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22799 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19993.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19993 commit 8f3581cf780dabdbf27ffd7924edf8c344d69b30 Author: Marco GaidoDate: 2017-12-15T18:11:18Z [SPARK-22799][ML] Bucketizer should throw exception if single- and multi-column params are both set --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org