Menglong TAN created SPARK-19781: ------------------------------------ Summary: Bucketizer's handleInvalid leave null values untouched unlike the NaNs Key: SPARK-19781 URL: https://issues.apache.org/jira/browse/SPARK-19781 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.1.0 Reporter: Menglong TAN Priority: Minor
Bucketizer can put NaN values into a special bucket when handleInvalid is on. but leave null values untouched. import org.apache.spark.ml.feature.Bucketizer val data = sc.parallelize(Seq(("crackcell", null.asInstanceOf[java.lang.Double]))).toDF("name", "number") val bucketizer = new Bucketizer().setInputCol("number").setOutputCol("number_output").setSplits(Array(Double.NegativeInfinity, 0, 10, Double.PositiveInfinity)).setHandleInvalid("keep") val res = bucketizer.transform(data) res.show(1) will output: +---------+------+-------------+ | name|number|number_output| +---------+------+-------------+ |crackcell| null| null| +---------+------+-------------+ If we change null to NaN: val data2 = sc.parallelize(Seq(("crackcell", Double.NaN))).toDF("name", "number") data2: org.apache.spark.sql.DataFrame = [name: string, number: double] bucketizer.transform(data2).show(1) will output: +---------+------+-------------+ | name|number|number_output| +---------+------+-------------+ |crackcell| NaN| 3.0| +---------+------+-------------+ Maybe we should unify the behaviours? Is it resonable to process nulls as well? If so, maybe my code can help. :-) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org