[ https://issues.apache.org/jira/browse/IGNITE-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750303#comment-17750303 ]
Alexandr Shapkin commented on IGNITE-20139: ------------------------------------------- [~zaleslaw] could you please take a look? Seems to be a valid issue alongside the attached reproducer. > RandomForestClassifierTrainer accuracy issue > -------------------------------------------- > > Key: IGNITE-20139 > URL: https://issues.apache.org/jira/browse/IGNITE-20139 > Project: Ignite > Issue Type: Bug > Components: ml > Affects Versions: 2.15 > Reporter: Alexandr Shapkin > Priority: Major > Attachments: TreeSample2_Portfolio_Change.png, random-forest.zip > > > We tried to use GridGain's machine learning capabilities, and discovered a > bug in GG's implementation of Random Forest. When comparing GG's output with > python prototype (scikit-learn lib), we noticed that GG's predictions have > much lower accuracy despite using the same data set and model parameters. > Further investigation showed that GridGain generates decision trees that > kinda "loop". The tree starts checking the same condition over and over until > it reaches the maximum tree depth. > I've attached a standalone reproducer which uses a small excerpt of our data > set. > It loads data from the csv file, then performs the training of the model for > just 1 tree. Then the reproducer finds one of the looping branches and prints > it. You will see that every single node in the branch uses the same feature, > value and has then same calculated impurity. > On my machine the code reproduces this issue 100% of time. > I've also attached an example of the tree generated by python's scikit-learn > on the same data set with the same parameters. In python the tree usually > doesn't get deeper than 20 nodes. -- This message was sent by Atlassian Jira (v8.20.10#820010)