[ https://issues.apache.org/jira/browse/SPARK-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415600#comment-16415600 ]
Barry Becker commented on SPARK-6162: ------------------------------------- If we all agree that is is something that would be very nice to have, why is it closed as won't fix instead of just being deferred to a future release? This seems like a big limitation of spark Tree models in Spark. > Handle missing values in GBM > ---------------------------- > > Key: SPARK-6162 > URL: https://issues.apache.org/jira/browse/SPARK-6162 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.2.1 > Reporter: Devesh Parekh > Priority: Major > > We build a lot of predictive models over data combined from multiple sources, > where some entries may not have all sources of data and so some values are > missing in each feature vector. Another place this might come up is if you > have features from slightly heterogeneous items (or items composed of > heterogeneous subcomponents) that share many features in common but may have > extra features for different types, and you don't want to manually train > models for every different type. > R's GBM library, which is what we are currently using, deals with this type > of data nicely by making "missing" nodes in the decision tree (a surrogate > split) for features that can have missing values. We'd like to do the same > with MLLib, but LabeledPoint would need to support missing values, and > GradientBoostedTrees would need to be modified to deal with them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org