[jira] [Updated] (SPARK-55969) REGR_R2 returns wrong result

Shaobo Guan (Jira) Wed, 11 Mar 2026 20:02:40 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-55969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shaobo Guan updated SPARK-55969:
--------------------------------
    Description: 
REGR_R2 returns wrong results:

Row 6 (grp=same_x):
Column 'out': actual=1.0, expected=null
Row 7 (grp=same_y):
Column 'out': actual=null, expected=1.0

 

Repro
|CREATE|
| | | |OR REPLACE TEMPORARY VIEW t AS|
| | | |SELECT *|
| | | |FROM VALUES|
| | | |('all_null', NULL, NULL), ('all_null', NULL, NULL),|
| | | |('all_null', NULL, NULL),|
| | | |('single', 1.0, 2.0),|
| | | |('single_null', NULL, NULL),|
| | | |('same_x', 1.0, 5.0),|
| | | |('same_x', 2.0, 5.0),|
| | | |('same_x', 3.0, 5.0),|
| | | |('same_y', 5.0, 1.0),|
| | | |('same_y', 5.0, 2.0),|
| | | |('same_y', 5.0, 3.0) AS t(grp, y, x);|

select grp, regr_r2(y, x) as out from t group by grp order by grp;
 
 

Notice y and x are passed in the flipped order as per 
[https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#:~:text=Returns%20the%20average%20of%20the,regr_sxy(y%2C%20x)]

 

Why I believe this is wrong result?

Unfortunately, I gave all my stats 101 knowledge back to my college professor. 
So I googled it:
Summary of Behavioral Edge Cases 
||Scenario ||Data Condition||{{REGR_R2(y, x)}} Result||
|*Same*
𝑥{*}{{*}}|!wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==! 
VAR_POP(x)=0VAR_POP open paren x close paren equals 0VAR_POP(𝑥)=0|*NULL* 
(undefined slope)|
|*Same*
𝑦{*}{{*}}|!wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==! 
VAR_POP(y)=0VAR_POP open paren y close paren equals 0VAR_POP(𝑦)=0 AND
!wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==! 
VAR_POP(x)≠0VAR_POP open paren x close paren is not equal to 0VAR_POP(𝑥)≠0|*1* 
(perfect "fit" to horizontal line)|
|*Empty Set*|No rows or only NULL pairs|*NULL*|

  was:
REGR_R2 returns wrong results:

Row 6 (grp=same_x):
Column 'out': actual=1.0, expected=null
Row 7 (grp=same_y):
Column 'out': actual=null, expected=1.0

 

Repro
|CREATE|
| | | |OR REPLACE TEMPORARY VIEW t AS|
| | | |SELECT *|
| | | |FROM VALUES|
| | | |('all_null', NULL, NULL), ('all_null', NULL, NULL),|
| | | |('all_null', NULL, NULL),|
| | | |('single', 1.0, 2.0),|
| | | |('single_null', NULL, NULL),|
| | | |('same_x', 1.0, 5.0),|
| | | |('same_x', 2.0, 5.0),|
| | | |('same_x', 3.0, 5.0),|
| | | |('same_y', 5.0, 1.0),|
| | | |('same_y', 5.0, 2.0),|
| | | |('same_y', 5.0, 3.0) AS t(grp, y, x);

select grp, regr_r2(y, x) as out from t group by grp order by grp;
|

 

Notice y and x are passed in the flipped order as per 
[https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#:~:text=Returns%20the%20average%20of%20the,regr_sxy(y%2C%20x)]

 

Why I believe this is wrong result?

Unfortunately, I gave all my stats101 knowledge to my college professor. So I 
googled it:
Summary of Behavioral Edge Cases 
||Scenario ||Data Condition||{{REGR_R2(y, x)}} Result||
|*Same*
𝑥{*}{*}|!wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==! 
VAR_POP(x)=0VAR_POP open paren x close paren equals 0VAR_POP(𝑥)=0|*NULL* 
(undefined slope)||
|*Same*
𝑦{*}{*}|!wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==! 
VAR_POP(y)=0VAR_POP open paren y close paren equals 0VAR_POP(𝑦)=0 AND
!wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==! 
VAR_POP(x)≠0VAR_POP open paren x close paren is not equal to 0VAR_POP(𝑥)≠0|*1* 
(perfect "fit" to horizontal line)||
|{*}Empty Set{*}|No rows or only NULL pairs|*NULL*|


> REGR_R2 returns wrong result
> ----------------------------
>
>                 Key: SPARK-55969
>                 URL: https://issues.apache.org/jira/browse/SPARK-55969
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.6
>            Reporter: Shaobo Guan
>            Priority: Major
>
> REGR_R2 returns wrong results:
> Row 6 (grp=same_x):
> Column 'out': actual=1.0, expected=null
> Row 7 (grp=same_y):
> Column 'out': actual=null, expected=1.0
>  
> Repro
> |CREATE|
> | | | |OR REPLACE TEMPORARY VIEW t AS|
> | | | |SELECT *|
> | | | |FROM VALUES|
> | | | |('all_null', NULL, NULL), ('all_null', NULL, NULL),|
> | | | |('all_null', NULL, NULL),|
> | | | |('single', 1.0, 2.0),|
> | | | |('single_null', NULL, NULL),|
> | | | |('same_x', 1.0, 5.0),|
> | | | |('same_x', 2.0, 5.0),|
> | | | |('same_x', 3.0, 5.0),|
> | | | |('same_y', 5.0, 1.0),|
> | | | |('same_y', 5.0, 2.0),|
> | | | |('same_y', 5.0, 3.0) AS t(grp, y, x);|
> select grp, regr_r2(y, x) as out from t group by grp order by grp;
>  
>  
> Notice y and x are passed in the flipped order as per 
> [https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#:~:text=Returns%20the%20average%20of%20the,regr_sxy(y%2C%20x)]
>  
> Why I believe this is wrong result?
> Unfortunately, I gave all my stats 101 knowledge back to my college 
> professor. So I googled it:
> Summary of Behavioral Edge Cases 
> ||Scenario ||Data Condition||{{REGR_R2(y, x)}} Result||
> |*Same*
> 𝑥{*}{{*}}|!wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==! 
> VAR_POP(x)=0VAR_POP open paren x close paren equals 0VAR_POP(𝑥)=0|*NULL* 
> (undefined slope)|
> |*Same*
> 𝑦{*}{{*}}|!wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==! 
> VAR_POP(y)=0VAR_POP open paren y close paren equals 0VAR_POP(𝑦)=0 AND
> !wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==! 
> VAR_POP(x)≠0VAR_POP open paren x close paren is not equal to 
> 0VAR_POP(𝑥)≠0|*1* (perfect "fit" to horizontal line)|
> |*Empty Set*|No rows or only NULL pairs|*NULL*|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-55969) REGR_R2 returns wrong result

Reply via email to