[jira] [Updated] (SPARK-48719) Wrong Result in regr_slope_intercept Aggregate with Tuples has NULL

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48719:
---
Labels: pull-request-available  (was: )

> Wrong Result in regr_slope_intercept Aggregate with Tuples has NULL
> 
>
> Key: SPARK-48719
> URL: https://issues.apache.org/jira/browse/SPARK-48719
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Jonathon Lee
>Priority: Major
>  Labels: pull-request-available
>
> When calculate slope and intercept using regr_slope & regr_intercept 
> aggregate:
> (using Java api)
> {code:java}
> spark.sql("drop table if exists tab");
> spark.sql("CREATE TABLE tab(y int, x int) using parquet");
> spark.sql("INSERT INTO tab VALUES (1, 1)");
> spark.sql("INSERT INTO tab VALUES (2, 3)");
> spark.sql("INSERT INTO tab VALUES (3, 5)");
> spark.sql("INSERT INTO tab VALUES (NULL, 3)");
> spark.sql("INSERT INTO tab VALUES (3, NULL)");
> spark.sql("SELECT " +
> "regr_slope(x, y), " +
> "regr_intercept(x, y)" +
> "FROM tab").show(); {code}
> Spark result:
> {code:java}
> +--++
> |  regr_slope(x, y)|regr_intercept(x, y)|
> +--++
> |1.4545454545454546| 0.09090909090909083|
> +--++ {code}
> The correct answer should be 2.0 and -1.0 obviously.
>  
> Reason:
> In sql/catalyst/expressions/aggregate/linearRegression.scala,
>  
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(right)
> .. {code}
> CovPopulation will filter tuples which right *OR* left is NULL
> But VariancePop will only filter null right expression.
> This will cause wrong result when some of the tuples' left is null (and right 
> is not null).
> {*}Same reason with RegrIntercept{*}.
>  
> A possible fix:
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(If(And(IsNotNull(left), 
> IsNotNull(right)),
> right, Literal.create(null, right.dataType))) 
> .{code}
> *same fix to RegrIntercept*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48719) Wrong Result in regr_slope_intercept Aggregate with Tuples has NULL

2024-06-25 Thread Jonathon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathon Lee updated SPARK-48719:
-
Description: 
When calculate slope and intercept using regr_slope & regr_intercept aggregate:

(using Java api)
{code:java}
spark.sql("drop table if exists tab");
spark.sql("CREATE TABLE tab(y int, x int) using parquet");
spark.sql("INSERT INTO tab VALUES (1, 1)");
spark.sql("INSERT INTO tab VALUES (2, 3)");
spark.sql("INSERT INTO tab VALUES (3, 5)");
spark.sql("INSERT INTO tab VALUES (NULL, 3)");
spark.sql("INSERT INTO tab VALUES (3, NULL)");
spark.sql("SELECT " +
"regr_slope(x, y), " +
"regr_intercept(x, y)" +
"FROM tab").show(); {code}
Spark result:
{code:java}
+--++
|  regr_slope(x, y)|regr_intercept(x, y)|
+--++
|1.4545454545454546| 0.09090909090909083|
+--++ {code}
The correct answer should be 2.0 and -1.0 obviously.

 

Reason:

In sql/catalyst/expressions/aggregate/linearRegression.scala,

 
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends 
DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(right)
.. {code}
CovPopulation will filter tuples which right *OR* left is NULL

But VariancePop will only filter null right expression.

This will cause wrong result when some of the tuples' left is null (and right 
is not null).

{*}Same reason with RegrIntercept{*}.

 

A possible fix:
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends 
DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(If(And(IsNotNull(left), 
IsNotNull(right)),
right, Literal.create(null, right.dataType))) 
.{code}
*same fix to RegrIntercept*

  was:
When calculate slope and intercept using regr_slope & regr_intercept aggregate:

(using Java api)
{code:java}
spark.sql("drop table if exists tab");
spark.sql("CREATE TABLE tab(y int, x int) using parquet");
spark.sql("INSERT INTO tab VALUES (1, 1)");
spark.sql("INSERT INTO tab VALUES (2, 3)");
spark.sql("INSERT INTO tab VALUES (3, 5)");
spark.sql("INSERT INTO tab VALUES (NULL, 3)");
spark.sql("INSERT INTO tab VALUES (3, NULL)");
spark.sql("SELECT " +
"regr_slope(x, y), " +
"regr_intercept(x, y)" +
"FROM tab").show(); {code}
Spark result:
{code:java}
+--++
|  regr_slope(x, y)|regr_intercept(x, y)|
+--++
|1.4545454545454546| 0.09090909090909083|
+--++ {code}
The correct answer should be 2.0 and -1.0 obviously.

 

Reason:

In sql/catalyst/expressions/aggregate/linearRegression.scala,

 
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends 
DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(right)
.. {code}
CovPopulation will filter tuples which right *OR* left is NULL

But VariancePop will only filter null right expression.

This will cause wrong result when some of the tuples' left is null (and right 
is not null).

Same reason with RegrIntercept.

 

A possible fix:
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends 
DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(If(And(IsNotNull(left), 
IsNotNull(right)),
right, Literal.create(null, right.dataType))) 
.{code}
 


> Wrong Result in regr_slope_intercept Aggregate with Tuples has NULL
> 
>
> Key: SPARK-48719
> URL: https://issues.apache.org/jira/browse/SPARK-48719
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Jonathon Lee
>Priority: Major
>
> When calculate slope and intercept using regr_slope & regr_intercept 
> aggregate:
> (using Java api)
> {code:java}
> spark.sql("drop table if exists tab");
> spark.sql("CREATE TABLE tab(y int, x int) using parquet");
> spark.sql("INSERT INTO tab VALUES (1, 1)");
> spark.sql("INSERT INTO tab VALUES (2, 3)");
> spark.sql("INSERT INTO tab VALUES (3, 5)");
> spark.sql("INSERT INTO tab VALUES (NULL, 3)");
> spark.sql("INSERT INTO tab VALUES (3, NULL)");
> spark.sql("SELECT " +
> "regr_slope(x, y), " +
> "regr_intercept(x, y)" +
> "FROM tab").show(); {code}
> Spark result:
> {code:java}