[jira] [Commented] (IGNITE-20139) RandomForestClassifierTrainer accuracy issue

Alexandr Shapkin (Jira) Wed, 02 Aug 2023 06:14:05 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750303#comment-17750303
 ]


Alexandr Shapkin commented on IGNITE-20139:
-------------------------------------------

[~zaleslaw] could you please take a look? Seems to be a valid issue alongside 
the attached reproducer.

> RandomForestClassifierTrainer accuracy issue
> --------------------------------------------
>
>                 Key: IGNITE-20139
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20139
>             Project: Ignite
>          Issue Type: Bug
>          Components: ml
>    Affects Versions: 2.15
>            Reporter: Alexandr Shapkin
>            Priority: Major
>         Attachments: TreeSample2_Portfolio_Change.png, random-forest.zip
>
>
> We tried to use GridGain's machine learning capabilities, and discovered a 
> bug in GG's implementation of Random Forest. When comparing GG's output with 
> python prototype (scikit-learn lib), we noticed that GG's predictions have 
> much lower accuracy despite using the same data set and model parameters. 
> Further investigation showed that GridGain generates decision trees that 
> kinda "loop". The tree starts checking the same condition over and over until 
> it reaches the maximum tree depth.
> I've attached a standalone reproducer which uses a small excerpt of our data 
> set. 
> It loads data from the csv file, then performs the training of the model for 
> just 1 tree. Then the reproducer finds one of the looping branches and prints 
> it. You will see that every single node in the branch uses the same feature, 
> value and has then same calculated impurity. 
> On my machine the code reproduces this issue 100% of time.
> I've also attached an example of the tree generated by python's scikit-learn 
> on the same data set with the same parameters. In python the tree usually 
> doesn't get deeper than 20 nodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-20139) RandomForestClassifierTrainer accuracy issue

Reply via email to