[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435484#comment-15435484
 ] 

Barry Becker commented on SPARK-17219:
--------------------------------------

Nulls were not accepted in the column. I had to change them from null to NaN so 
they would be. If there were a way keep them as nulls and have the 
discretization work, that would be preferable. I think there should be some 
strategy for dealing with null (or NaN values) besides just ignoring them. They 
are legitimate values in many cases. They represent missing data - which may be 
very important and useful for some algorithms. How about having the ability to 
set a nullStrategy param that indicates what you want to do with them. For 
example, throw an error, ignore them, put them in a special bucket, etc.
 

> QuantileDiscretizer does strange things with NaN values
> -------------------------------------------------------
>
>                 Key: SPARK-17219
>                 URL: https://issues.apache.org/jira/browse/SPARK-17219
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.2
>            Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to