[ https://issues.apache.org/jira/browse/SPARK-19781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890373#comment-15890373 ]
Apache Spark commented on SPARK-19781: -------------------------------------- User 'crackcell' has created a pull request for this issue: https://github.com/apache/spark/pull/17123 > Bucketizer's handleInvalid leave null values untouched unlike the NaNs > ---------------------------------------------------------------------- > > Key: SPARK-19781 > URL: https://issues.apache.org/jira/browse/SPARK-19781 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.1.0 > Reporter: Menglong TAN > Priority: Minor > Labels: MLlib > Original Estimate: 2h > Remaining Estimate: 2h > > Bucketizer can put NaN values into a special bucket when handleInvalid is on. > but leave null values untouched. > {code} > import org.apache.spark.ml.feature.Bucketizer > val data = sc.parallelize(Seq(("crackcell", > null.asInstanceOf[java.lang.Double]))).toDF("name", "number") > val bucketizer = new > Bucketizer().setInputCol("number").setOutputCol("number_output").setSplits(Array(Double.NegativeInfinity, > 0, 10, Double.PositiveInfinity)).setHandleInvalid("keep") > val res = bucketizer.transform(data) > res.show(1) > {code} > will output: > {quote} > +---------+------+-------------+ > | name|number|number_output| > +---------+------+-------------+ > |crackcell| null| null| > +---------+------+-------------+ > {quote} > If we change null to NaN: > {code} > val data2 = sc.parallelize(Seq(("crackcell", Double.NaN))).toDF("name", > "number") > data2: org.apache.spark.sql.DataFrame = [name: string, number: double] > bucketizer.transform(data2).show(1) > {code} > will output: > {quote} > +---------+------+-------------+ > | name|number|number_output| > +---------+------+-------------+ > |crackcell| NaN| 3.0| > +---------+------+-------------+ > {quote} > Maybe we should unify the behaviours? Is it resonable to process nulls as > well? If so, maybe my code can help. :-) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org