[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237774#comment-16237774 ]
Teng Peng commented on SPARK-22433: ----------------------------------- What I agree with you: be coherent, and we prefer ML-oreinted standard. What I want to add: be coherent, and we prefer ML-oreinted standard only if we are talking about ML. If we are talking about traditional statistics, we should stick to the established standard of traditional statistics. What I want to explain: 1. ML world: there is training set and test set. We have this to evaluate if we our models have good prediction performance. If we don't have them, then unavoidably overfitting. Traditional statistics world: there is no training set and test, because our goal is interpretation of models, not prediction performance. R^2 is in the framework of traditional statistics, and it has nothing to do with prediction related goals. If we are using R^2, we are in the domain of traditional statistics. If our goal is interpretation, then we look at R^2. 2. The regressionMetric and regressionEvaluator is designed for ML related goals using linear regression approach(which might be useful for a benchmark). So this two are actually in the domain of ML world, not traditional statistics. However, R^2 is mixed into it. This mixture appear everywhere. Looking at test("cross validation with linear regression") . R^2 is evaluated by cross validation, and the larger the better. This is a misunderstanding of what R2 is. The bottom line: there is a clear distinction between traditional statistics and ML. If something belongs to traditional statistics, then we should not mix them with ML. > Linear regression R^2 train/test terminology related > ----------------------------------------------------- > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.2.0 > Reporter: Teng Peng > Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearRegressionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org