[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

Barry Becker (JIRA) Fri, 07 Oct 2016 12:01:13 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15555968#comment-15555968
 ]


Barry Becker commented on SPARK-17219:
--------------------------------------

I'll make another attempt to clarify my use case.

Nulls are different than normal values, and trying to impute them changes the 
data and its interpretation - possibly in a misleading way.
Throwing out records with null values (or giving an error on them) is worse 
because you are discarding a lot of potentially useful information.

Suppose you have survey results or exam data results. If you try to impute the 
answer that a student should have made on the exam before you do your ML, you 
will get results that make it look like all students answered all questions, 
when it might have been the case that many were left blank. Similar situation 
for survey data. The fact that responses were left blank is important. You 
don't want to discard it or replace it with some actual value if it was left 
blank.

I wish spark would handle nulls as first class entities throughout MLlib. 

> QuantileDiscretizer does strange things with NaN values
> -------------------------------------------------------
>
>                 Key: SPARK-17219
>                 URL: https://issues.apache.org/jira/browse/SPARK-17219
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.2
>            Reporter: Barry Becker
>            Assignee: Vincent
>             Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

Reply via email to