Devesh Parekh created SPARK-6162:
------------------------------------

             Summary: Handle missing values in GBM
                 Key: SPARK-6162
                 URL: https://issues.apache.org/jira/browse/SPARK-6162
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.2.1
            Reporter: Devesh Parekh


We build a lot of predictive models over data combined from multiple sources, 
where some entries may not have all sources of data and so some values are 
missing in each feature vector. Another place this might come up is if you have 
features from slightly heterogeneous items (or items composed of heterogeneous 
subcomponents) that share many features in common but may have extra features 
for different types, and you don't want to manually train models for every 
different type.

R's GBM library, which is what we are currently using, deals with this type of 
data nicely by making "missing" nodes in the decision tree (a surrogate split) 
for features that can have missing values. We'd like to do the same with MLLib, 
but LabeledPoint would need to support missing values, and GradientBoostedTrees 
would need to be modified to deal with them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to