[jira] [Commented] (SPARK-13568) Create feature transformer to impute missing values

2016-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186714#comment-15186714
 ] 

Apache Spark commented on SPARK-13568:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/11601

> Create feature transformer to impute missing values
> ---
>
> Key: SPARK-13568
> URL: https://issues.apache.org/jira/browse/SPARK-13568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> It is quite common to encounter missing values in data sets. It would be 
> useful to implement a {{Transformer}} that can impute missing data points, 
> similar to e.g. {{Imputer}} in 
> [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values].
> Initially, options for imputation could include {{mean}}, {{median}} and 
> {{most frequent}}, but we could add various other approaches. Where possible 
> existing DataFrame code can be used (e.g. for approximate quantiles etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13568) Create feature transformer to impute missing values

2016-03-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177368#comment-15177368
 ] 

Nick Pentreath commented on SPARK-13568:


Ok - the Imputer will need to compute column stats ignoring NaNs, so 
SPARK-13639 should add that (whether as default behaviour, or an optional 
argument)

> Create feature transformer to impute missing values
> ---
>
> Key: SPARK-13568
> URL: https://issues.apache.org/jira/browse/SPARK-13568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> It is quite common to encounter missing values in data sets. It would be 
> useful to implement a {{Transformer}} that can impute missing data points, 
> similar to e.g. {{Imputer}} in 
> [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values].
> Initially, options for imputation could include {{mean}}, {{median}} and 
> {{most frequent}}, but we could add various other approaches. Where possible 
> existing DataFrame code can be used (e.g. for approximate quantiles etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13568) Create feature transformer to impute missing values

2016-02-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172423#comment-15172423
 ] 

yuhao yang commented on SPARK-13568:


Yes, I'm working on support numeric values too. 

And I agree about the imputation for vector should check the elements in the 
vector. I intends to support the 3 use cases you mentioned.

I'll send a PR today or tomorrow after some refine and performance benchmark. 
Thanks

> Create feature transformer to impute missing values
> ---
>
> Key: SPARK-13568
> URL: https://issues.apache.org/jira/browse/SPARK-13568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> It is quite common to encounter missing values in data sets. It would be 
> useful to implement a {{Transformer}} that can impute missing data points, 
> similar to e.g. {{Imputer}} in 
> [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values].
> Initially, options for imputation could include {{mean}}, {{median}} and 
> {{most frequent}}, but we could add various other approaches. Where possible 
> existing DataFrame code can be used (e.g. for approximate quantiles etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13568) Create feature transformer to impute missing values

2016-02-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172328#comment-15172328
 ] 

Nick Pentreath commented on SPARK-13568:


Sure, go ahead. However, taking a quick look at your branch, I think the 
approach needs a bit of discussion.

I think the Imputer should handle numeric and/or vector columns. If a vector 
column, the idea is not to impute an entire vector when it is null, but rather 
the missing (null / NaN) values that may be present in each vector.

I guess if a vector column itself has missing values (i.e. entire vector is 
null), then the result would look something like what you have done.

I tend to think that usage within a pipeline is more likely to be imputing 
missing values from a set of numeric columns, before applying further 
transformations into feature vectors. However, we can potentially support all 
three use cases. 

> Create feature transformer to impute missing values
> ---
>
> Key: SPARK-13568
> URL: https://issues.apache.org/jira/browse/SPARK-13568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> It is quite common to encounter missing values in data sets. It would be 
> useful to implement a {{Transformer}} that can impute missing data points, 
> similar to e.g. {{Imputer}} in 
> [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values].
> Initially, options for imputation could include {{mean}}, {{median}} and 
> {{most frequent}}, but we could add various other approaches. Where possible 
> existing DataFrame code can be used (e.g. for approximate quantiles etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13568) Create feature transformer to impute missing values

2016-02-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172216#comment-15172216
 ] 

yuhao yang commented on SPARK-13568:


Hi Nick, can I work on this since I kind of already have... 
I got an implementation at 
https://github.com/hhbyyh/spark/blob/imputer/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala
 

> Create feature transformer to impute missing values
> ---
>
> Key: SPARK-13568
> URL: https://issues.apache.org/jira/browse/SPARK-13568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> It is quite common to encounter missing values in data sets. It would be 
> useful to implement a {{Transformer}} that can impute missing data points, 
> similar to e.g. {{Imputer}} in 
> [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values].
> Initially, options for imputation could include {{mean}}, {{median}} and 
> {{most frequent}}, but we could add various other approaches. Where possible 
> existing DataFrame code can be used (e.g. for approximate quantiles etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org