[ 
https://issues.apache.org/jira/browse/SPARK-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179447#comment-15179447
 ] 

Nick Pentreath commented on SPARK-13639:
----------------------------------------

For SPARK-13568, we can take one of two approaches:

1. Support imputing numerical DF columns, as well as imputing columns within a 
vector (itself a vector DF column);
2. Only support imputing numerical DF columns.

For #1, we then need {{Statistics.colStats}} to support ignoring NaN as an 
option (agree it should definitely not be default behaviour). Potentially we 
could only support it at a lower level (perhaps within 
{{MultivariateOnlineSummarizer}}).

For scikit-learn's Imputer, obviously it works on NaN vector elements, but 
since we are working with DataFrames, my initial idea was actually more along 
the lines of #2. The {{Imputer}} would tend to be among the early steps in a 
pipeline, before the relevant numerical columns were transformed into a vector.

So #1 is not an absolute requirement IMO, though obviously it would be more 
efficient to compute all the col stats for a set of columns together, and I do 
think it makes sense to support vector input types in {{Imputer}} if possible.

Open to ideas on SPARK-13568 also.

> Statistics.colStats(rdd).mean and variance should handle NaN in the input 
> vectors
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-13639
>                 URL: https://issues.apache.org/jira/browse/SPARK-13639
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: yuhao yang
>            Priority: Trivial
>
>    val denseData = Array(
>       Vectors.dense(3.8, 0.0, 1.8),
>       Vectors.dense(1.7, 0.9, 0.0),
>       Vectors.dense(Double.NaN, 0, 0.0)
>     )
>     val rdd = sc.parallelize(denseData)
>     println(Statistics.colStats(rdd).mean)
> [NaN,0.3,0.6]
> This is just a proposal for discussion on how to handle the NaN value in the 
> vectors. We can ignore the NaN value in the computation or just output NaN as 
> it is now as a warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to