[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436863#comment-15436863 ]
Sean Owen commented on SPARK-17219: ----------------------------------- Yes, agree with that. I think it will involve a change to anything that bucket-izes, and anything that consumes the buckets, because it will require special handling to put NaN in the right 'bucket'. > QuantileDiscretizer does strange things with NaN values > ------------------------------------------------------- > > Key: SPARK-17219 > URL: https://issues.apache.org/jira/browse/SPARK-17219 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 1.6.2 > Reporter: Barry Becker > > How is the QuantileDiscretizer supposed to handle null values? > Actual nulls are not allowed, so I replace them with Double.NaN. > However, when you try to run the QuantileDiscretizer on a column that > contains NaNs, it will create (possibly more than one) NaN split(s) before > the final PositiveInfinity value. > I am using the attache titanic csv data and trying to bin the "age" column > using the QuantileDiscretizer with 10 bins specified. The age column as a lot > of null values. > These are the splits that I get: > {code} > -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity > {code} > Is that expected. It seems to imply that NaN is larger than any positive > number and less than infinity. > I'm not sure of the best way to handle nulls, but I think they need a bucket > all their own. My suggestions would be to include an initial NaN split value > that is always there, just like the sentinel Infinities are. If that were the > case, then the splits for the example above might look like this: > {code} > NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity > {code} > This does not seem great either because a bucket that is [NaN, -Inf] doesn't > make much sense. Not sure if the NaN bucket counts toward numBins or not. I > do think it should always be there though in case future data has null even > though the fit data did not. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org