Github user MarcKaminski commented on the issue: https://github.com/apache/spark/pull/17819 Hello, I found a bug that occurs when putting the new Bucketizer into a Pipeline and calling fit on it. Calling fit on a Pipeline calls the corresponding transformSchema of each PipelineStage in it. Therefore, the transformSchema [method of the Bucketizer](https://github.com/viirya/spark-1/blob/f8dedd1c92a8c48358743626b99c2f2192bc09b1/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L146) is called, which checks for the params of the **conventional** Bucketizer (i.e. inputCol). Steps to reproduce: ``` import org.apache.spark.ml._ import org.apache.spark.ml.feature.Bucketizer case class data(f1: Double, f2: Double) val datArr = Array(data(0.5, 0.3), data(0.5, -0.4)) val df = spark.createDataFrame(datArr) val bucket = new Bucketizer() .setInputCols(Array("f1", "f2")) .setOutputCols(Array("f1_bu", "f2_bu")) .setSplitsArray(Array(Array(-0.5, 0.0, 0.5), Array(-0.5, 0.0, 0.5))) // Will work bucket.transform(df) show // Will fail catastrophically val pl = new Pipeline() .setStages(Array(bucket)) .fit(df) ``` Exception thrown by last line: ``` java.util.NoSuchElementException: Failed to find a default value for inputCol at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:691) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:691) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:690) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:697) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:147) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136) ... 55 elided ``` Since this has not yet been merged into Master, maybe you'd be still able to fix this and add a test for? Thanks!
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org