[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435651#comment-15435651 ]
Barry Becker commented on SPARK-17219: -------------------------------------- If the decision is to have an additional null/NaN bucket, then I agree that other choices aren't needed. I agree that that the null/NaN bucket can be separate from maxBins (i.e. request 10, but get 11). A couple of other things to consider: - I think there should always be a null/NaN bucket present for the same reason that the first and last bins are -Inf and +Inf respectively. Just because there were no nulls in the training/fitting data does not mean that they will not come through later and need to be placed somewhere. - Currently validation fails if there are fewer than 3 splits specified for a Bucketizer. I actually think that 2 splits should be the minimum - even though that means only 1 bucket! The reason is that some algorithms (like Naive Bayes) may choose to bin features (using MDLP discretization for example) into just 2 buckets - null and non-null. If we now have a null bucket always present, we may just want a single [-Inf, Inf] bucket to for non-nulls - as strange at that sounds. > QuantileDiscretizer does strange things with NaN values > ------------------------------------------------------- > > Key: SPARK-17219 > URL: https://issues.apache.org/jira/browse/SPARK-17219 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 1.6.2 > Reporter: Barry Becker > > How is the QuantileDiscretizer supposed to handle null values? > Actual nulls are not allowed, so I replace them with Double.NaN. > However, when you try to run the QuantileDiscretizer on a column that > contains NaNs, it will create (possibly more than one) NaN split(s) before > the final PositiveInfinity value. > I am using the attache titanic csv data and trying to bin the "age" column > using the QuantileDiscretizer with 10 bins specified. The age column as a lot > of null values. > These are the splits that I get: > {code} > -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity > {code} > Is that expected. It seems to imply that NaN is larger than any positive > number and less than infinity. > I'm not sure of the best way to handle nulls, but I think they need a bucket > all their own. My suggestions would be to include an initial NaN split value > that is always there, just like the sentinel Infinities are. If that were the > case, then the splits for the example above might look like this: > {code} > NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity > {code} > This does not seem great either because a bucket that is [NaN, -Inf] doesn't > make much sense. Not sure if the NaN bucket counts toward numBins or not. I > do think it should always be there though in case future data has null even > though the fit data did not. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org