[ 
https://issues.apache.org/jira/browse/IGNITE-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandr Shapkin updated IGNITE-20139:
--------------------------------------
    Attachment: random-forest.zip
                TreeSample2_Portfolio_Change.png

> RandomForestClassifierTrainer is checking the same conditions
> -------------------------------------------------------------
>
>                 Key: IGNITE-20139
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20139
>             Project: Ignite
>          Issue Type: Bug
>          Components: ml
>    Affects Versions: 2.15
>            Reporter: Alexandr Shapkin
>            Priority: Major
>         Attachments: TreeSample2_Portfolio_Change.png, random-forest.zip
>
>
> We tried to use GridGain's machine learning capabilities, and discovered a 
> bug in GG's implementation of Random Forest. When comparing GG's output with 
> python prototype (scikit-learn lib), we noticed that GG's predictions have 
> much lower accuracy despite using the same data set and model parameters. 
> Further investigation showed that GridGain generates decision trees that 
> kinda "loop". The tree starts checking the same condition over and over until 
> it reaches the maximum tree depth.
> I've attached a standalone reproducer which uses a small excerpt of our data 
> set. 
> It loads data from the csv file, then performs the training of the model for 
> just 1 tree. Then the reproducer finds one of the looping branches and prints 
> it. You will see that every single node in the branch uses the same feature, 
> value and has then same calculated impurity. 
> On my machine the code reproduces this issue 100% of time.
> I've also attached an example of the tree generated by python's scikit-learn 
> on the same data set with the same parameters. In python the tree usually 
> doesn't get deeper than 20 nodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to