Hello team,
I am trying to understand why does logistic regression return uncalibrated
probabilities with values tending to low probabilities for the positive (rare)
cases, when trained on an imbalanced dataset.
I've read a number of articles, all seem to agree that this is the case, many
show empirical proof, but no mathematical demo. When I test it myself, I can
see that this is indeed the case, Logit on imbalanced datasets returns
uncalibrated probs.
And I understand that it has to do with the cost function, because if we
re-balance the dataset with say class_weight = 'balance'. then the
probabilities seem to be calibrated as a result.
I was wondering if any of you knows the mathematical demo that supports this
conclusion? Any mathematical demo, or clear explanation of why logit would
return uncalibrated probs when trained on an imbalanced dataset?
Any link to a relevant article, video, presentation, etc, will be greatly
appreciated.
Thanks a lot!
Sole
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn