[
https://issues.apache.org/jira/browse/MADLIB-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16506613#comment-16506613
]
Rahul Iyer commented on MADLIB-1205:
------------------------------------
The work for this is done, but I'm facing an issue with testing.
We compare results with rpart in R and the variable importance depends on the
actual tree trained. There is enough stochasticity in the tree train procedure
to lead to different trees, resulting in differing importance values.
It's not obvious to me how to compare these importance values. Qualitatively
they look similar (for eg. ordering of features are the same) but the actual
values differ significantly in some cases, while being pretty close for others.
> Add gini importance to DT and RF
> --------------------------------
>
> Key: MADLIB-1205
> URL: https://issues.apache.org/jira/browse/MADLIB-1205
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Decision Tree, Module: Random Forest
> Reporter: Rahul Iyer
> Assignee: Rahul Iyer
> Priority: Major
> Fix For: v1.15
>
>
> From the Breiman resource that we use for random forest:
> {quote}Gini importance
> {quote}
> {quote}Every time a split of a node is made on variable m the gini impurity
> criterion for the two descendent nodes is less than the parent node. Adding
> up the gini decreases for each individual variable over all trees in the
> forest gives a fast variable importance that is often very consistent with
> the permutation importance measure.
> {quote}
> We can add a similar measure in our DT and RF code and distinguish this from
> our permuted importance metric by calling the current metric as
> {{oob_variable_importance}} and this new metric as
> {{impurity_variable_importance}}.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)