[scikit-learn] imbalanced datasets return uncalibrated predictions - why?

Sole Galli via scikit-learn Tue, 17 Nov 2020 00:59:47 -0800

Hello team,

I am trying to understand why does logistic regression return uncalibrated 
probabilities with values tending to low probabilities for the positive (rare) 
cases, when trained on an imbalanced dataset.


I've read a number of articles, all seem to agree that this is the case, many 
show empirical proof, but no mathematical demo. When I test it myself, I can 
see that this is indeed the case, Logit on imbalanced datasets returns 
uncalibrated probs.

And I understand that it has to do with the cost function, because if we 
re-balance the dataset with say class_weight = 'balance'. then the 
probabilities seem to be calibrated as a result.

I was wondering if any of you knows the mathematical demo that supports this 
conclusion? Any mathematical demo, or clear explanation of why logit would 
return uncalibrated probs when trained on an imbalanced dataset?

Any link to a relevant article, video, presentation, etc, will be greatly 
appreciated.

Thanks a lot!

Sole

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] imbalanced datasets return uncalibrated predictions - why?

Reply via email to