[ https://issues.apache.org/jira/browse/SPARK-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182434#comment-15182434 ]
Joseph K. Bradley commented on SPARK-6162: ------------------------------------------ I agree this will be nice to add someday, but it's less pressing than other tasks for now. > Handle missing values in GBM > ---------------------------- > > Key: SPARK-6162 > URL: https://issues.apache.org/jira/browse/SPARK-6162 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.2.1 > Reporter: Devesh Parekh > > We build a lot of predictive models over data combined from multiple sources, > where some entries may not have all sources of data and so some values are > missing in each feature vector. Another place this might come up is if you > have features from slightly heterogeneous items (or items composed of > heterogeneous subcomponents) that share many features in common but may have > extra features for different types, and you don't want to manually train > models for every different type. > R's GBM library, which is what we are currently using, deals with this type > of data nicely by making "missing" nodes in the decision tree (a surrogate > split) for features that can have missing values. We'd like to do the same > with MLLib, but LabeledPoint would need to support missing values, and > GradientBoostedTrees would need to be modified to deal with them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org