[ https://issues.apache.org/jira/browse/SPARK-48719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-48719: ----------------------------------- Assignee: Jonathon Lee > Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL > ------------------------------------------------------------------------ > > Key: SPARK-48719 > URL: https://issues.apache.org/jira/browse/SPARK-48719 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 3.4.0 > Reporter: Jonathon Lee > Assignee: Jonathon Lee > Priority: Major > Labels: pull-request-available > > When calculate slope and intercept using regr_slope & regr_intercept > aggregate: > (using Java api) > {code:java} > spark.sql("drop table if exists tab"); > spark.sql("CREATE TABLE tab(y int, x int) using parquet"); > spark.sql("INSERT INTO tab VALUES (1, 1)"); > spark.sql("INSERT INTO tab VALUES (2, 3)"); > spark.sql("INSERT INTO tab VALUES (3, 5)"); > spark.sql("INSERT INTO tab VALUES (NULL, 3)"); > spark.sql("INSERT INTO tab VALUES (3, NULL)"); > spark.sql("SELECT " + > "regr_slope(x, y), " + > "regr_intercept(x, y)" + > "FROM tab").show(); {code} > Spark result: > {code:java} > +------------------+--------------------+ > | regr_slope(x, y)|regr_intercept(x, y)| > +------------------+--------------------+ > |1.4545454545454546| 0.09090909090909083| > +------------------+--------------------+ {code} > The correct answer should be 2.0 and -1.0 obviously. > > Reason: > In sql/catalyst/expressions/aggregate/linearRegression.scala, > > {code:java} > case class RegrSlope(left: Expression, right: Expression) extends > DeclarativeAggregate > with ImplicitCastInputTypes with BinaryLike[Expression] { > private val covarPop = new CovPopulation(right, left) > private val varPop = new VariancePop(right) > ...... {code} > CovPopulation will filter tuples which right *OR* left is NULL > But VariancePop will only filter null right expression. > This will cause wrong result when some of the tuples' left is null (and right > is not null). > {*}Same reason with RegrIntercept{*}. > > A possible fix: > {code:java} > case class RegrSlope(left: Expression, right: Expression) extends > DeclarativeAggregate > with ImplicitCastInputTypes with BinaryLike[Expression] { > private val covarPop = new CovPopulation(right, left) > private val varPop = new VariancePop(If(And(IsNotNull(left), > IsNotNull(right)), > right, Literal.create(null, right.dataType))) > .....{code} > *same fix to RegrIntercept* -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org