[
https://issues.apache.org/jira/browse/SPARK-48719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun closed SPARK-48719.
---------------------------------
> Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL
> ------------------------------------------------------------------------
>
> Key: SPARK-48719
> URL: https://issues.apache.org/jira/browse/SPARK-48719
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 3.4.0
> Reporter: Jonathon Lee
> Assignee: Jonathon Lee
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> When calculate slope and intercept using regr_slope & regr_intercept
> aggregate:
> (using Java api)
> {code:java}
> spark.sql("drop table if exists tab");
> spark.sql("CREATE TABLE tab(y int, x int) using parquet");
> spark.sql("INSERT INTO tab VALUES (1, 1)");
> spark.sql("INSERT INTO tab VALUES (2, 3)");
> spark.sql("INSERT INTO tab VALUES (3, 5)");
> spark.sql("INSERT INTO tab VALUES (NULL, 3)");
> spark.sql("INSERT INTO tab VALUES (3, NULL)");
> spark.sql("SELECT " +
> "regr_slope(x, y), " +
> "regr_intercept(x, y)" +
> "FROM tab").show(); {code}
> Spark result:
> {code:java}
> +------------------+--------------------+
> | regr_slope(x, y)|regr_intercept(x, y)|
> +------------------+--------------------+
> |1.4545454545454546| 0.09090909090909083|
> +------------------+--------------------+ {code}
> The correct answer should be 2.0 and -1.0 obviously.
>
> Reason:
> In sql/catalyst/expressions/aggregate/linearRegression.scala,
>
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends
> DeclarativeAggregate
> with ImplicitCastInputTypes with BinaryLike[Expression] {
> private val covarPop = new CovPopulation(right, left)
> private val varPop = new VariancePop(right)
> ...... {code}
> CovPopulation will filter tuples which right *OR* left is NULL
> But VariancePop will only filter null right expression.
> This will cause wrong result when some of the tuples' left is null (and right
> is not null).
> {*}Same reason with RegrIntercept{*}.
>
> A possible fix:
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends
> DeclarativeAggregate
> with ImplicitCastInputTypes with BinaryLike[Expression] {
> private val covarPop = new CovPopulation(right, left)
> private val varPop = new VariancePop(If(And(IsNotNull(left),
> IsNotNull(right)),
> right, Literal.create(null, right.dataType)))
> .....{code}
> *same fix to RegrIntercept*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]