[ 
https://issues.apache.org/jira/browse/SPARK-48719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48719:
-----------------------------------

    Assignee: Jonathon Lee

> Wrong Result in regr_slope&regr_intercept Aggregate with Tuples has NULL
> ------------------------------------------------------------------------
>
>                 Key: SPARK-48719
>                 URL: https://issues.apache.org/jira/browse/SPARK-48719
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 3.4.0
>            Reporter: Jonathon Lee
>            Assignee: Jonathon Lee
>            Priority: Major
>              Labels: pull-request-available
>
> When calculate slope and intercept using regr_slope & regr_intercept 
> aggregate:
> (using Java api)
> {code:java}
> spark.sql("drop table if exists tab");
> spark.sql("CREATE TABLE tab(y int, x int) using parquet");
> spark.sql("INSERT INTO tab VALUES (1, 1)");
> spark.sql("INSERT INTO tab VALUES (2, 3)");
> spark.sql("INSERT INTO tab VALUES (3, 5)");
> spark.sql("INSERT INTO tab VALUES (NULL, 3)");
> spark.sql("INSERT INTO tab VALUES (3, NULL)");
> spark.sql("SELECT " +
>         "regr_slope(x, y), " +
>         "regr_intercept(x, y)" +
>         "FROM tab").show(); {code}
> Spark result:
> {code:java}
> +------------------+--------------------+
> |  regr_slope(x, y)|regr_intercept(x, y)|
> +------------------+--------------------+
> |1.4545454545454546| 0.09090909090909083|
> +------------------+--------------------+ {code}
> The correct answer should be 2.0 and -1.0 obviously.
>  
> Reason:
> In sql/catalyst/expressions/aggregate/linearRegression.scala,
>  
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(right)
> ...... {code}
> CovPopulation will filter tuples which right *OR* left is NULL
> But VariancePop will only filter null right expression.
> This will cause wrong result when some of the tuples' left is null (and right 
> is not null).
> {*}Same reason with RegrIntercept{*}.
>  
> A possible fix:
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(If(And(IsNotNull(left), 
> IsNotNull(right)),
>     right, Literal.create(null, right.dataType))) 
> .....{code}
> *same fix to RegrIntercept*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to