[ https://issues.apache.org/jira/browse/SPARK-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179447#comment-15179447 ]
Nick Pentreath commented on SPARK-13639: ---------------------------------------- For SPARK-13568, we can take one of two approaches: 1. Support imputing numerical DF columns, as well as imputing columns within a vector (itself a vector DF column); 2. Only support imputing numerical DF columns. For #1, we then need {{Statistics.colStats}} to support ignoring NaN as an option (agree it should definitely not be default behaviour). Potentially we could only support it at a lower level (perhaps within {{MultivariateOnlineSummarizer}}). For scikit-learn's Imputer, obviously it works on NaN vector elements, but since we are working with DataFrames, my initial idea was actually more along the lines of #2. The {{Imputer}} would tend to be among the early steps in a pipeline, before the relevant numerical columns were transformed into a vector. So #1 is not an absolute requirement IMO, though obviously it would be more efficient to compute all the col stats for a set of columns together, and I do think it makes sense to support vector input types in {{Imputer}} if possible. Open to ideas on SPARK-13568 also. > Statistics.colStats(rdd).mean and variance should handle NaN in the input > vectors > --------------------------------------------------------------------------------- > > Key: SPARK-13639 > URL: https://issues.apache.org/jira/browse/SPARK-13639 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: yuhao yang > Priority: Trivial > > val denseData = Array( > Vectors.dense(3.8, 0.0, 1.8), > Vectors.dense(1.7, 0.9, 0.0), > Vectors.dense(Double.NaN, 0, 0.0) > ) > val rdd = sc.parallelize(denseData) > println(Statistics.colStats(rdd).mean) > [NaN,0.3,0.6] > This is just a proposal for discussion on how to handle the NaN value in the > vectors. We can ignore the NaN value in the computation or just output NaN as > it is now as a warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org