[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435651#comment-15435651
 ] 

Barry Becker commented on SPARK-17219:
--------------------------------------

If the decision is to have an additional null/NaN bucket, then I agree that 
other choices aren't needed. 
I agree that that the null/NaN bucket can be separate from maxBins (i.e. 
request 10, but get 11).
A couple of other things to consider:
- I think there should always be a null/NaN bucket present for the same reason 
that the first and last bins are -Inf and +Inf respectively. Just because there 
were no nulls in the training/fitting data does not mean that they will not 
come through later and need to be placed somewhere.
-  Currently validation fails if there are fewer than 3 splits specified for a 
Bucketizer. I actually think that 2 splits should be the minimum - even though 
that means only 1 bucket! The reason is that some algorithms (like Naive Bayes) 
may choose to bin features (using MDLP discretization for example) into just 2 
buckets - null and non-null. If we now have a null bucket always present, we 
may just want a single [-Inf, Inf] bucket to for non-nulls - as strange at that 
sounds.


> QuantileDiscretizer does strange things with NaN values
> -------------------------------------------------------
>
>                 Key: SPARK-17219
>                 URL: https://issues.apache.org/jira/browse/SPARK-17219
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.2
>            Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to