[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796748#comment-15796748 ]
Joseph K. Bradley commented on SPARK-13677: ------------------------------------------- [~podongfeng] Apologies for the inaction on this, but I agree with you about redoing this for the DataFrame-based API. Could you please propose an API here before implementing it? Thanks! > Support Tree-Based Feature Transformation for ML > ------------------------------------------------ > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: zhengruifeng > Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is > implemented in two famous library: > sklearn > (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) > xgboost > (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) > I have implement it in mllib: > val features : RDD[Vector] = ... > val model1 : RandomForestModel = ... > val transformed1 : RDD[Vector] = model1.leaf(features) > val model2 : GradientBoostedTreesModel = ... > val transformed2 : RDD[Vector] = model2.leaf(features) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org