[ https://issues.apache.org/jira/browse/SPARK-19781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Menglong TAN updated SPARK-19781: --------------------------------- Description: Bucketizer can put NaN values into a special bucket when handleInvalid is on. but leave null values untouched. ``` import org.apache.spark.ml.feature.Bucketizer val data = sc.parallelize(Seq(("crackcell", null.asInstanceOf[java.lang.Double]))).toDF("name", "number") val bucketizer = new Bucketizer().setInputCol("number").setOutputCol("number_output").setSplits(Array(Double.NegativeInfinity, 0, 10, Double.PositiveInfinity)).setHandleInvalid("keep") val res = bucketizer.transform(data) res.show(1) ``` will output: +---------+------+-------------+ | name|number|number_output| +---------+------+-------------+ |crackcell| null| null| +---------+------+-------------+ If we change null to NaN: val data2 = sc.parallelize(Seq(("crackcell", Double.NaN))).toDF("name", "number") data2: org.apache.spark.sql.DataFrame = [name: string, number: double] bucketizer.transform(data2).show(1) will output: +---------+------+-------------+ | name|number|number_output| +---------+------+-------------+ |crackcell| NaN| 3.0| +---------+------+-------------+ Maybe we should unify the behaviours? Is it resonable to process nulls as well? If so, maybe my code can help. :-) was: Bucketizer can put NaN values into a special bucket when handleInvalid is on. but leave null values untouched. import org.apache.spark.ml.feature.Bucketizer val data = sc.parallelize(Seq(("crackcell", null.asInstanceOf[java.lang.Double]))).toDF("name", "number") val bucketizer = new Bucketizer().setInputCol("number").setOutputCol("number_output").setSplits(Array(Double.NegativeInfinity, 0, 10, Double.PositiveInfinity)).setHandleInvalid("keep") val res = bucketizer.transform(data) res.show(1) will output: +---------+------+-------------+ | name|number|number_output| +---------+------+-------------+ |crackcell| null| null| +---------+------+-------------+ If we change null to NaN: val data2 = sc.parallelize(Seq(("crackcell", Double.NaN))).toDF("name", "number") data2: org.apache.spark.sql.DataFrame = [name: string, number: double] bucketizer.transform(data2).show(1) will output: +---------+------+-------------+ | name|number|number_output| +---------+------+-------------+ |crackcell| NaN| 3.0| +---------+------+-------------+ Maybe we should unify the behaviours? Is it resonable to process nulls as well? If so, maybe my code can help. :-) > Bucketizer's handleInvalid leave null values untouched unlike the NaNs > ---------------------------------------------------------------------- > > Key: SPARK-19781 > URL: https://issues.apache.org/jira/browse/SPARK-19781 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.1.0 > Reporter: Menglong TAN > Priority: Minor > Labels: MLlib > Original Estimate: 2h > Remaining Estimate: 2h > > Bucketizer can put NaN values into a special bucket when handleInvalid is on. > but leave null values untouched. > ``` > import org.apache.spark.ml.feature.Bucketizer > val data = sc.parallelize(Seq(("crackcell", > null.asInstanceOf[java.lang.Double]))).toDF("name", "number") > val bucketizer = new > Bucketizer().setInputCol("number").setOutputCol("number_output").setSplits(Array(Double.NegativeInfinity, > 0, 10, Double.PositiveInfinity)).setHandleInvalid("keep") > val res = bucketizer.transform(data) > res.show(1) > ``` > will output: > +---------+------+-------------+ > | name|number|number_output| > +---------+------+-------------+ > |crackcell| null| null| > +---------+------+-------------+ > If we change null to NaN: > val data2 = sc.parallelize(Seq(("crackcell", Double.NaN))).toDF("name", > "number") > data2: org.apache.spark.sql.DataFrame = [name: string, number: double] > bucketizer.transform(data2).show(1) > will output: > +---------+------+-------------+ > | name|number|number_output| > +---------+------+-------------+ > |crackcell| NaN| 3.0| > +---------+------+-------------+ > Maybe we should unify the behaviours? Is it resonable to process nulls as > well? If so, maybe my code can help. :-) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org