[ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237774#comment-16237774
 ] 

Teng Peng commented on SPARK-22433:
-----------------------------------

What I agree with you: be coherent, and we prefer ML-oreinted standard.

What I want to add: be coherent, and we prefer ML-oreinted standard only if we 
are talking about ML. If we are talking about traditional statistics, we should 
stick to the established standard of traditional statistics.

What I want to explain:
1.
ML world: there is training set and test set. We have this to evaluate if we 
our models have good prediction performance. If we don't have them, then 
unavoidably overfitting.
Traditional statistics world: there is no training set and test, because our 
goal is interpretation of models, not prediction performance. 
R^2 is in the framework of traditional statistics, and it has nothing to do 
with prediction related goals. If we are using R^2, we are in the domain of 
traditional statistics. If our goal is interpretation, then we look at R^2. 

2.
The regressionMetric and regressionEvaluator is designed for ML related goals 
using linear regression approach(which might be useful for a benchmark). So 
this two are actually in the domain of ML world, not traditional statistics. 
However, R^2 is mixed into it. This mixture appear everywhere. Looking at 
test("cross validation with linear regression") . R^2 is evaluated by cross 
validation, and the larger the better. This is a misunderstanding of what R2 is.

The bottom line: there is a clear distinction between traditional statistics 
and ML. If something belongs to traditional statistics, then we should not mix 
them with ML.

> Linear regression R^2 train/test terminology related 
> -----------------------------------------------------
>
>                 Key: SPARK-22433
>                 URL: https://issues.apache.org/jira/browse/SPARK-22433
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Teng Peng
>            Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to