[jira] [Updated] (IGNITE-20139) RandomForestClassifierTrainer accuracy issue

2023-08-09 Thread Alexandr Shapkin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandr Shapkin updated IGNITE-20139:
--
Description: 
We tried to use machine learning capabilities, and discovered a bug in 
implementation of Random Forest. When comparing Ignite's output with python 
prototype (scikit-learn lib), we noticed that Ignite's predictions have much 
lower accuracy despite using the same data set and model parameters. 

Further investigation showed that Ignite generates decision trees that kinda 
"loop". The tree starts checking the same condition over and over until it 
reaches the maximum tree depth.

I've attached a standalone reproducer which uses a small excerpt of our data 
set. 

It loads data from the csv file, then performs the training of the model for 
just 1 tree. Then the reproducer finds one of the looping branches and prints 
it. You will see that every single node in the branch uses the same feature, 
value and has then same calculated impurity. 

On my machine the code reproduces this issue 100% of time.

I've also attached an example of the tree generated by python's scikit-learn on 
the same data set with the same parameters. In python the tree usually doesn't 
get deeper than 20 nodes.

  was:
We tried to use GridGain's machine learning capabilities, and discovered a bug 
in GG's implementation of Random Forest. When comparing GG's output with python 
prototype (scikit-learn lib), we noticed that GG's predictions have much lower 
accuracy despite using the same data set and model parameters. 

Further investigation showed that GridGain generates decision trees that kinda 
"loop". The tree starts checking the same condition over and over until it 
reaches the maximum tree depth.

I've attached a standalone reproducer which uses a small excerpt of our data 
set. 

It loads data from the csv file, then performs the training of the model for 
just 1 tree. Then the reproducer finds one of the looping branches and prints 
it. You will see that every single node in the branch uses the same feature, 
value and has then same calculated impurity. 

On my machine the code reproduces this issue 100% of time.

I've also attached an example of the tree generated by python's scikit-learn on 
the same data set with the same parameters. In python the tree usually doesn't 
get deeper than 20 nodes.


> RandomForestClassifierTrainer accuracy issue
> 
>
> Key: IGNITE-20139
> URL: https://issues.apache.org/jira/browse/IGNITE-20139
> Project: Ignite
>  Issue Type: Bug
>  Components: ml
>Affects Versions: 2.15
>Reporter: Alexandr Shapkin
>Assignee: Igor Belyakov
>Priority: Major
> Attachments: TreeSample2_Portfolio_Change.png, random-forest.zip
>
>
> We tried to use machine learning capabilities, and discovered a bug in 
> implementation of Random Forest. When comparing Ignite's output with python 
> prototype (scikit-learn lib), we noticed that Ignite's predictions have much 
> lower accuracy despite using the same data set and model parameters. 
> Further investigation showed that Ignite generates decision trees that kinda 
> "loop". The tree starts checking the same condition over and over until it 
> reaches the maximum tree depth.
> I've attached a standalone reproducer which uses a small excerpt of our data 
> set. 
> It loads data from the csv file, then performs the training of the model for 
> just 1 tree. Then the reproducer finds one of the looping branches and prints 
> it. You will see that every single node in the branch uses the same feature, 
> value and has then same calculated impurity. 
> On my machine the code reproduces this issue 100% of time.
> I've also attached an example of the tree generated by python's scikit-learn 
> on the same data set with the same parameters. In python the tree usually 
> doesn't get deeper than 20 nodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-20139) RandomForestClassifierTrainer accuracy issue

2023-08-02 Thread Alexandr Shapkin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandr Shapkin updated IGNITE-20139:
--
Summary: RandomForestClassifierTrainer accuracy issue  (was: 
RandomForestClassifierTrainer is checking the same conditions)

> RandomForestClassifierTrainer accuracy issue
> 
>
> Key: IGNITE-20139
> URL: https://issues.apache.org/jira/browse/IGNITE-20139
> Project: Ignite
>  Issue Type: Bug
>  Components: ml
>Affects Versions: 2.15
>Reporter: Alexandr Shapkin
>Priority: Major
> Attachments: TreeSample2_Portfolio_Change.png, random-forest.zip
>
>
> We tried to use GridGain's machine learning capabilities, and discovered a 
> bug in GG's implementation of Random Forest. When comparing GG's output with 
> python prototype (scikit-learn lib), we noticed that GG's predictions have 
> much lower accuracy despite using the same data set and model parameters. 
> Further investigation showed that GridGain generates decision trees that 
> kinda "loop". The tree starts checking the same condition over and over until 
> it reaches the maximum tree depth.
> I've attached a standalone reproducer which uses a small excerpt of our data 
> set. 
> It loads data from the csv file, then performs the training of the model for 
> just 1 tree. Then the reproducer finds one of the looping branches and prints 
> it. You will see that every single node in the branch uses the same feature, 
> value and has then same calculated impurity. 
> On my machine the code reproduces this issue 100% of time.
> I've also attached an example of the tree generated by python's scikit-learn 
> on the same data set with the same parameters. In python the tree usually 
> doesn't get deeper than 20 nodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)