zhengruifeng created SPARK-13677:
------------------------------------

             Summary: Support Tree-Based Feature Transformation for mllib
                 Key: SPARK-13677
                 URL: https://issues.apache.org/jira/browse/SPARK-13677
             Project: Spark
          Issue Type: New Feature
            Reporter: zhengruifeng
            Priority: Minor


It would be nice to be able to use RF and GBT for feature transformation:
First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
implemented in two famous library:
sklearn 
(http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
xgboost 
(https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)

I have implement it in mllib:

val features : RDD[Vector] = ...
val model1 : RandomForestModel = ...
val transformed1 : RDD[Vector] = model1.leaf(features)

val model2 : GradientBoostedTreesModel = ...
val transformed2 : RDD[Vector] = model2.leaf(features)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to