He doesn't only talk about black box vs statistical, he talks about
model based vs prediction based.
He says that if you validate predictions, you don't need to
(necessarily) worry about model misspecification.
A linear regression model can be misspecified, and it can be overfit.
Just fitting the model will not inform you whether either of these is
the case.
Because the model is simple and well understood, there is ways to check
model misspecification and overfit in several ways.
A train-test-split doesn't exactly tell you whether the model is
misspecified (errors could be non-normal and prediction could still be
good),
but it gives you an idea if the model is "useful".
Basically: you need to validate whatever you did. There are model-based
approaches and there are prediction based approaches.
Prediction based approaches are always applicable, model-based
approaches are usually more limited and harder to do (but if you find a
good model you got a model of the process, which is great!). But you
need to pick at least one of the two approaches.
On 6/12/19 2:36 PM, C W wrote:
Thank you both for the papers references.
@ Andreas,
What is your take? And what are you implying?
The Breiman (2001) paper points out the black box vs. statistical
approach. I call them black box vs. open box. He advocates black box
in the paper.
Black box:
y <--- nature <--- x
Open box:
y <--- linear regression <---- x
Decision trees and neural nets are black box model. They require large
amount of data to train, and skip the part where it tries to
understand nature.
Because it is a black box, you can't open up to see what's inside.
Linear regression is a very simple model that you can use to
approximate nature, but the key thing is that you need to know how the
data are generated.
@ Brown,
I know nothing about molecular modeling. The paper your linked "Beware
of q2!" paper raises some interesting point, as far as I see in
sklearn linear regression, score is R^2.
On Wed, Jun 5, 2019 at 9:11 AM Andreas Mueller <t3k...@gmail.com
<mailto:t3k...@gmail.com>> wrote:
On 6/4/19 8:44 PM, C W wrote:
> Thank you all for the replies.
>
> I agree that prediction accuracy is great for evaluating
black-box ML
> models. Especially advanced models like neural networks, or
> not-so-black models like LASSO, because they are NP-hard to solve.
>
> Linear regression is not a black-box. I view prediction accuracy
as an
> overkill on interpretable models. Especially when you can use
> R-squared, coefficient significance, etc.
>
> Prediction accuracy also does not tell you which feature is
important.
>
> What do you guys think? Thank you!
>
Did you read the paper that I sent? ;)
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn