subject:"\[GitHub\] spark pull request #19993\: \[SPARK\-22799\]\[ML\] Bucketizer should throw excepti..."

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-26 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19993


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-24 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r163562784
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
@@ -401,15 +390,24 @@ class BucketizerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
 }
   }
 
-  test("Both inputCol and inputCols are set") {
-val bucket = new Bucketizer()
-  .setInputCol("feature1")
-  .setOutputCol("result")
-  .setSplits(Array(-0.5, 0.0, 0.5))
-  .setInputCols(Array("feature1", "feature2"))
-
-// When both are set, we ignore `inputCols` and just map the column 
specified by `inputCol`.
-assert(bucket.isBucketizeMultipleColumns() == false)
+  test("assert exception is thrown if both multi-column and single-column 
params are set") {
+val df = Seq((0.5, 0.3), (0.5, -0.4)).toDF("feature1", "feature2")
+ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", 
"feature1"),
+  ("inputCols", Array("feature1", "feature2")))
+ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", 
"feature1"),
+  ("outputCol", "result1"), ("splits", Array(-0.5, 0.0, 0.5)),
+  ("outputCols", Array("result1", "result2")))
+ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", 
"feature1"),
+  ("outputCol", "result1"), ("splits", Array(-0.5, 0.0, 0.5)),
+  ("splitsArray", Array(Array(-0.5, 0.0, 0.5), Array(-0.5, 0.0, 0.5
+
+// this should fail because at least one of inputCol and inputCols 
must be set
+ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", 
"feature1"),
+  ("splits", Array(-0.5, 0.0, 0.5)))
+
+// the following should fail because not all the params are set
--- End diff --

Technically here we should probably also test the `inputCols` + 
`outputCols` case (i.e. that not setting `splitsArray` also throws an 
exception). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-24 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r163561075
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala 
---
@@ -20,8 +20,11 @@ package org.apache.spark.ml.param
 import java.io.{ByteArrayOutputStream, ObjectOutputStream}
 
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.{Estimator, Transformer}
 import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasInputCols, 
HasOutputCol, HasOutputCols}
--- End diff --

I don't think these are used any longer?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-22 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r162955686
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
@@ -401,15 +390,14 @@ class BucketizerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
 }
   }
 
-  test("Both inputCol and inputCols are set") {
-val bucket = new Bucketizer()
-  .setInputCol("feature1")
-  .setOutputCol("result")
-  .setSplits(Array(-0.5, 0.0, 0.5))
-  .setInputCols(Array("feature1", "feature2"))
-
-// When both are set, we ignore `inputCols` and just map the column 
specified by `inputCol`.
-assert(bucket.isBucketizeMultipleColumns() == false)
+  test("assert exception is thrown if both multi-column and single-column 
params are set") {
+val df = Seq((0.5, 0.3), (0.5, -0.4)).toDF("feature1", "feature2")
+ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", 
"feature1"),
+  ("inputCols", Array("feature1", "feature2")))
+ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", 
"result1"),
+  ("outputCols", Array("result1", "result2")))
+ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", 
Array(-0.5, 0.0, 0.5)),
--- End diff --

@MLnick actually it will fail for both reasons. We can add more test cases 
to check each of these two cases if you think it is needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-22 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r162940665
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
@@ -401,15 +390,14 @@ class BucketizerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
 }
   }
 
-  test("Both inputCol and inputCols are set") {
-val bucket = new Bucketizer()
-  .setInputCol("feature1")
-  .setOutputCol("result")
-  .setSplits(Array(-0.5, 0.0, 0.5))
-  .setInputCols(Array("feature1", "feature2"))
-
-// When both are set, we ignore `inputCols` and just map the column 
specified by `inputCol`.
-assert(bucket.isBucketizeMultipleColumns() == false)
+  test("assert exception is thrown if both multi-column and single-column 
params are set") {
+val df = Seq((0.5, 0.3), (0.5, -0.4)).toDF("feature1", "feature2")
+ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", 
"feature1"),
+  ("inputCols", Array("feature1", "feature2")))
+ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", 
"result1"),
+  ("outputCols", Array("result1", "result2")))
+ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", 
Array(-0.5, 0.0, 0.5)),
--- End diff --

Only comment I have is that I believe this line is not testing what you may 
think.

As I read the 
[checkSingleVsMultiColumnParams](https://github.com/apache/spark/pull/19993/files#diff-72f95a0938e5a140d5126f06bdc381a6R266)
 method, in this test case it will throw the error, _not_ because both `splits` 
and `splitsArray` are set, but rather because both `inputCol` & `inputCols` are 
_unset_.

Actually it applies to the line above too.

@jkbradley 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-19 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r162747183
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -201,9 +184,13 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
-if (isBucketizeMultipleColumns()) {
+ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols")
--- End diff --

I see.  I'll see if I can come up with something which is generic but 
handles these other checks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r162719519
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -201,9 +184,13 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
-if (isBucketizeMultipleColumns()) {
+ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols")
--- End diff --

my initial implementation (with @hhbyyh's comments) was more generic and 
checked what you said. After, @MLnick and @viirya asked to switch to a more 
generic approach which is the current you see. I'm fine with either of those, 
but I think we need to choose one way and go in that direction, otherwise we 
just loose time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-19 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r162717263
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -166,6 +167,8 @@ private[ml] object Param {
 @DeveloperApi
 object ParamValidators {
 
+  private val LOGGER = LoggerFactory.getLogger(ParamValidators.getClass)
--- End diff --

Let's switch this to use the Logging trait, to match other MLlib patterns.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-19 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r162717142
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -201,9 +184,13 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
-if (isBucketizeMultipleColumns()) {
+ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols")
--- End diff --

The problem with trying to use a general method like this is that it's hard 
to capture model-specific requirements.  This currently misses checking to make 
sure that exactly one (not just <= 1) of each pair is available, plus that all 
of the single-column OR all of the multi-column Params are available.  (The 
same issue occurs in https://github.com/apache/spark/pull/20146 )  It will also 
be hard to check these items and account for defaults.

I'd argue that it's not worth trying to use generic checking functions here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-17 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r162044954
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

@MLnick Yes. I didn't test the method posted here. The model possibly 
doesn't have the params, so we need to check it with `model.hasParam`. Please 
use the method in #20146.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-17 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r162043704
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

@viirya your actual method in #20146 is slightly different (see 
[here](https://github.com/apache/spark/pull/20146/files#diff-72f95a0938e5a140d5126f06bdc381a6R254)).
 Is that the best version to use?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-17 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r162042318
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

I think @viirya's method is simpler and more general, so why not use it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r161699705
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

@MLnick @viirya in order to address 
https://github.com/apache/spark/pull/19993/files#r161682506, I was thinking to 
let this method as it is (just renaming it as per @viirya suggestion) and only 
adding an `additionalExclusiveParams: (String, String)*` argument to the 
function. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-16 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r161685200
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

If this method looks good to you, maybe you can just copy it from #20146 to 
use here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-16 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r161684970
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

Based on https://github.com/apache/spark/pull/20146#issuecomment-357068054 
from @WeichenXu123, I think #20146 cannot get merged for 2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-16 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r161681586
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

I am not sure if #20146 will get merged for `2.3` - but I think we must 
merge this PR for `2.3` because I'd prefer not to have this inconsistency in 
param error handling between `QuantileDiscretizer` and `Bucketizer`. This is a 
relatively small change, so we can merge it into the branch if we move it 
quickly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-16 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r161682506
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
@@ -401,15 +390,9 @@ class BucketizerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
 }
   }
 
-  test("Both inputCol and inputCols are set") {
-val bucket = new Bucketizer()
-  .setInputCol("feature1")
-  .setOutputCol("result")
-  .setSplits(Array(-0.5, 0.0, 0.5))
-  .setInputCols(Array("feature1", "feature2"))
-
-// When both are set, we ignore `inputCols` and just map the column 
specified by `inputCol`.
-assert(bucket.isBucketizeMultipleColumns() == false)
+  test("assert exception is thrown if both multi-column and single-column 
params are set") {
--- End diff --

We should also test the other exclusive params (input cols and splits 
params) as per https://github.com/apache/spark/pull/19993/files#r159133936


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-08 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r160143590
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

I think we can use that method once merged, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-03 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r159579102
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

I added this method too in #20146.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r159133936
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

If we need to check other exclusive params, e.g., `inputCol` and 
`splitsArray` or `inputCols` and `splits`, why not just have a method like:

```scala
def checkExclusiveParams(model: Params, params: String*): Unit = {
  if (params.filter(model.isSet(_)).size > 1) {
val paramString = params.mkString("`", "`, `", "`")
throw new IllegalArgumentException(s"$paramString are exclusive, but 
more than one among them are set.")
  }
}
ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols")
ParamValidators.checkExclusiveParams(this, "outputCol", "outputCols")
ParamValidators.checkExclusiveParams(this, "inputCol", "splitsArray")
ParamValidators.checkExclusiveParams(this, "inputCols", "splits")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-29 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r159100390
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala 
---
@@ -430,4 +433,45 @@ object ParamsSuite extends SparkFunSuite {
 require(copyReturnType === obj.getClass,
   s"${clazz.getName}.copy should return ${clazz.getName} instead of 
${copyReturnType.getName}.")
   }
+
+  /**
+   * Checks that the class throws an exception in case both `inputCols` 
and `inputCol` are set and
+   * in case both `outputCols` and `outputCol` are set.
+   * These checks are performed only whether the class extends 
respectively both `HasInputCols` and
--- End diff --

only whether" -> "only when"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-29 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r159100191
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
+  }
+
+  private[spark] def raiseIncompatibleParamsException(
+  paramName1: String,
+  paramName2: String): Unit = {
+throw new IllegalArgumentException(s"`$paramName1` and `$paramName2` 
cannot be both set.")
--- End diff --

"cannot be both" -> "cannot both be"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-29 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r159099688
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -137,18 +137,10 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
   /**
* Determines whether this `Bucketizer` is going to map multiple 
columns. If and only if
* `inputCols` is set, it will map multiple columns. Otherwise, it just 
maps a column specified
-   * by `inputCol`. A warning will be printed if both are set.
+   * by `inputCol`.
*/
   private[feature] def isBucketizeMultipleColumns(): Boolean = {
-if (isSet(inputCols) && isSet(inputCol)) {
-  logWarning("Both `inputCol` and `inputCols` are set, we ignore 
`inputCols` and this " +
-"`Bucketizer` only map one column specified by `inputCol`")
-  false
-} else if (isSet(inputCols)) {
-  true
-} else {
-  false
-}
+isSet(inputCols)
--- End diff --

Seems superfluous to how have a separate method for this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-29 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r159100299
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
@@ -401,15 +401,9 @@ class BucketizerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
 }
   }
 
-  test("Both inputCol and inputCols are set") {
-val bucket = new Bucketizer()
-  .setInputCol("feature1")
-  .setOutputCol("result")
-  .setSplits(Array(-0.5, 0.0, 0.5))
-  .setInputCols(Array("feature1", "feature2"))
-
-// When both are set, we ignore `inputCols` and just map the column 
specified by `inputCol`.
-assert(bucket.isBucketizeMultipleColumns() == false)
+  test("assert exception is thrown is both multi-column and single-column 
params are set") {
--- End diff --

"is thrown is both" -> "is thrown if both"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-21 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r158344862
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,31 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("outputCols", "inputCol")
+  case m: HasInputCols with HasOutputCol if m.isSet(m.inputCols) && 
m.isSet(m.outputCol) =>
--- End diff --

Sorry to miss it, but I just found that FeatureHasher has both InputCols 
and OutputCol. 
I think we can remove the case and the one above since they can be too 
strict. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-20 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r158170154
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala 
---
@@ -430,4 +433,49 @@ object ParamsSuite extends SparkFunSuite {
 require(copyReturnType === obj.getClass,
   s"${clazz.getName}.copy should return ${clazz.getName} instead of 
${copyReturnType.getName}.")
   }
+
+  /**
+   * Checks that the class throws an exception in case both `inputCols` 
and `inputCol` are set and
+   * in case both `outputCols` and `outputCol` are set.
+   * These checks are performed only whether the class extends 
respectively both `HasInputCols` and
+   * `HasInputCol` and both `HasOutputCols` and `HasOutputCol`.
+   *
+   * @param paramsClass The Class to be checked
+   * @param spark A `SparkSession` instance to use
+   */
+  def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: 
SparkSession): Unit = {
+import spark.implicits._
+// create fake input Dataset
+val feature1 = Array(-1.0, 0.0, 1.0)
+val feature2 = Array(1.0, 0.0, -1.0)
+val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2")
--- End diff --

Send column names as parameter if necessary. We need to ensure the test 
utility can be used by most transformers with multiple column support.

> I think that the type check is performed later, thus it is not a problem 
here. 

I don't quite get it, in either `transform` or `fit` the data type will be 
checked and they will trigger exceptions.

> I preferred to use paramsClass: Class[_ <: Params]

I'm only thinking about the case that default constructor is not sufficient 
to create a working Estimator/Transformer. If that's not a concern, then 
reflection is OK.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r158158375
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala 
---
@@ -430,4 +433,49 @@ object ParamsSuite extends SparkFunSuite {
 require(copyReturnType === obj.getClass,
   s"${clazz.getName}.copy should return ${clazz.getName} instead of 
${copyReturnType.getName}.")
   }
+
+  /**
+   * Checks that the class throws an exception in case both `inputCols` 
and `inputCol` are set and
+   * in case both `outputCols` and `outputCol` are set.
+   * These checks are performed only whether the class extends 
respectively both `HasInputCols` and
+   * `HasInputCol` and both `HasOutputCols` and `HasOutputCol`.
+   *
+   * @param paramsClass The Class to be checked
+   * @param spark A `SparkSession` instance to use
+   */
+  def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: 
SparkSession): Unit = {
+import spark.implicits._
+// create fake input Dataset
+val feature1 = Array(-1.0, 0.0, 1.0)
+val feature2 = Array(1.0, 0.0, -1.0)
+val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2")
--- End diff --

The reason why I created the dataframe inside the method was to control the 
names of the columns it has. Otherwise we can't ensure that those columns 
exist. I think that the type check is performed later, thus it is  not a 
problem here. What do you think?

I preferred to use `paramsClass: Class[_ <: Params]` because I need a clean 
instance for each of the two checks: if an instance is passed I cannot enforce 
that it is clean, ie. some parameters weren't already set and I would need to 
copy it to create new instances as well, since otherwise the second check would 
be influenced by the first one. What do you think?

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-20 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r158154050
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala 
---
@@ -430,4 +433,49 @@ object ParamsSuite extends SparkFunSuite {
 require(copyReturnType === obj.getClass,
   s"${clazz.getName}.copy should return ${clazz.getName} instead of 
${copyReturnType.getName}.")
   }
+
+  /**
+   * Checks that the class throws an exception in case both `inputCols` 
and `inputCol` are set and
+   * in case both `outputCols` and `outputCol` are set.
+   * These checks are performed only whether the class extends 
respectively both `HasInputCols` and
+   * `HasInputCol` and both `HasOutputCols` and `HasOutputCol`.
+   *
+   * @param paramsClass The Class to be checked
+   * @param spark A `SparkSession` instance to use
+   */
+  def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: 
SparkSession): Unit = {
+import spark.implicits._
+// create fake input Dataset
+val feature1 = Array(-1.0, 0.0, 1.0)
+val feature2 = Array(1.0, 0.0, -1.0)
+val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2")
--- End diff --

I don't think the DataFrame here can be used for other transformers like 
StringIndexer.

How about add a Dataset as function parameter? And is it possible to use an 
instance `obj: Params` rather than `paramsClass: Class[_ <: Params]` as 
parameter, just to be more flexible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-20 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r158153277
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala 
---
@@ -430,4 +433,49 @@ object ParamsSuite extends SparkFunSuite {
 require(copyReturnType === obj.getClass,
   s"${clazz.getName}.copy should return ${clazz.getName} instead of 
${copyReturnType.getName}.")
   }
+
+  /**
+   * Checks that the class throws an exception in case both `inputCols` 
and `inputCol` are set and
+   * in case both `outputCols` and `outputCol` are set.
+   * These checks are performed only whether the class extends 
respectively both `HasInputCols` and
+   * `HasInputCol` and both `HasOutputCols` and `HasOutputCol`.
+   *
+   * @param paramsClass The Class to be checked
+   * @param spark A `SparkSession` instance to use
+   */
+  def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: 
SparkSession): Unit = {
--- End diff --

suggestion for function name:
assertColOrCols --> testMultiColumnParams


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-20 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r158153048
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,31 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def assertColOrCols(model: Params): Unit = {
--- End diff --

suggestion for function name:
assertColOrCols --> checkMultiColumnParams


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-19 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r157870176
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,29 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  def assertColOrCols(model: Params): Unit = {
--- End diff --

private[spark]


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-19 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r157871042
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,29 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  def assertColOrCols(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("outputCols", "inputCol")
+  case m: HasInputCols with HasOutputCol if m.isSet(m.inputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("inputCols", "outputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
+  }
+
+  def raiseIncompatibleParamsException(paramName1: String, paramName2: 
String): Unit = {
+throw new IllegalArgumentException(s"Both `$paramName1` and 
`$paramName2` are set.")
--- End diff --

Error message can be more straight forward. e.g. `$paramName1` and 
`$paramName2` cannot be set simultaneously.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-19 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r157867496
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -137,18 +137,17 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
   /**
* Determines whether this `Bucketizer` is going to map multiple 
columns. If and only if
* `inputCols` is set, it will map multiple columns. Otherwise, it just 
maps a column specified
-   * by `inputCol`. A warning will be printed if both are set.
+   * by `inputCol`. An exception will be thrown if both are set.
*/
   private[feature] def isBucketizeMultipleColumns(): Boolean = {
-if (isSet(inputCols) && isSet(inputCol)) {
-  logWarning("Both `inputCol` and `inputCols` are set, we ignore 
`inputCols` and this " +
-"`Bucketizer` only map one column specified by `inputCol`")
-  false
-} else if (isSet(inputCols)) {
-  true
-} else {
-  false
+ParamValidators.assertColOrCols(this)
+if (isSet(inputCol) && isSet(splitsArray)) {
--- End diff --

I noticed `isBucketizeMultipleColumns` is invoked in many places and maybe 
we can put the checks in other places like transformSchema. It also makes the 
code consistent with function name.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-19 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r157870214
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,29 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  def assertColOrCols(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("outputCols", "inputCol")
+  case m: HasInputCols with HasOutputCol if m.isSet(m.inputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("inputCols", "outputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
+  }
+
+  def raiseIncompatibleParamsException(paramName1: String, paramName2: 
String): Unit = {
--- End diff --

private[spark]


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-19 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r157869596
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,29 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  def assertColOrCols(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && 
m.isSet(m.inputCol) =>
--- End diff --

This may not necessarily be an error for some classes, but we can keep it 
for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-19 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r157684803
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
* by `inputCol`. A warning will be printed if both are set.
--- End diff --

We should modify the document above too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-17 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r157414340
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
* by `inputCol`. A warning will be printed if both are set.
*/
   private[feature] def isBucketizeMultipleColumns(): Boolean = {
-if (isSet(inputCols) && isSet(inputCol)) {
-  logWarning("Both `inputCol` and `inputCols` are set, we ignore 
`inputCols` and this " +
-"`Bucketizer` only map one column specified by `inputCol`")
-  false
+if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && 
isSet(outputCol) ||
+  isSet(inputCol) && isSet(outputCols)) {
+  throw new IllegalArgumentException("Both `inputCol` and `inputCols` 
are set, `Bucketizer` " +
+"only supports setting either `inputCol` or `inputCols`.")
--- End diff --

thanks. I will add these checks while doing the changes requested by 
@hhbyyh. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-17 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r157393913
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
* by `inputCol`. A warning will be printed if both are set.
*/
   private[feature] def isBucketizeMultipleColumns(): Boolean = {
-if (isSet(inputCols) && isSet(inputCol)) {
-  logWarning("Both `inputCol` and `inputCols` are set, we ignore 
`inputCols` and this " +
-"`Bucketizer` only map one column specified by `inputCol`")
-  false
+if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && 
isSet(outputCol) ||
+  isSet(inputCol) && isSet(outputCols)) {
+  throw new IllegalArgumentException("Both `inputCol` and `inputCols` 
are set, `Bucketizer` " +
+"only supports setting either `inputCol` or `inputCols`.")
--- End diff --

Here it is better to add one more check:
if single column, only set `splits` param.
if multiple column, only set `splitsArray` param.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-15 Thread mgaido91

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/19993

[SPARK-22799][ML] Bucketizer should throw exception if single- and 
multi-column params are both set

## What changes were proposed in this pull request?

Currently there is a mixed situation when both single- and multi-column are 
supported. In some cases exceptions are thrown, in others only a warning log is 
emitted. In this discussion 
https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049,
 the decision was to throw an exception.

The PR throws an exception in `Bucketizer`, instead of logging a warning.

## How was this patch tested?

modified UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-22799

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19993.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19993


commit 8f3581cf780dabdbf27ffd7924edf8c344d69b30
Author: Marco Gaido 
Date:   2017-12-15T18:11:18Z

[SPARK-22799][ML] Bucketizer should throw exception if single- and 
multi-column params are both set




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

39 matches

Mail list logo