[ 
https://issues.apache.org/jira/browse/SPARK-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415600#comment-16415600
 ] 

Barry Becker commented on SPARK-6162:
-------------------------------------

If we all agree that is is something that would be very nice to have, why is it 
closed as won't fix instead of just being deferred to a future release?

This seems like a big limitation of spark Tree models in Spark.

> Handle missing values in GBM
> ----------------------------
>
>                 Key: SPARK-6162
>                 URL: https://issues.apache.org/jira/browse/SPARK-6162
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.1
>            Reporter: Devesh Parekh
>            Priority: Major
>
> We build a lot of predictive models over data combined from multiple sources, 
> where some entries may not have all sources of data and so some values are 
> missing in each feature vector. Another place this might come up is if you 
> have features from slightly heterogeneous items (or items composed of 
> heterogeneous subcomponents) that share many features in common but may have 
> extra features for different types, and you don't want to manually train 
> models for every different type.
> R's GBM library, which is what we are currently using, deals with this type 
> of data nicely by making "missing" nodes in the decision tree (a surrogate 
> split) for features that can have missing values. We'd like to do the same 
> with MLLib, but LabeledPoint would need to support missing values, and 
> GradientBoostedTrees would need to be modified to deal with them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to