andygrove opened a new pull request, #4775:
URL: https://github.com/apache/datafusion-comet/pull/4775

   ## Which issue does this PR close?
   
   Closes #4552.
   
   ## Rationale for this change
   
   Comet already accelerates `regr_avgx`, `regr_avgy`, and `regr_count` (Spark 
rewrites those to `Average`/`Count`), but the remaining six SQL-standard linear 
regression aggregates fell back to Spark. This PR implements native support for 
`regr_slope`, `regr_intercept`, `regr_r2`, `regr_sxx`, `regr_syy`, and 
`regr_sxy`, so a query using any of them can run fully on Comet instead of 
falling back.
   
   ## What changes are included in this PR?
   
   - A new native `Regr` aggregate UDF in 
`native/spark-expr/src/agg_funcs/regr.rs`. Rather than re-implementing the 
statistics, each function is composed from Comet's existing Spark-compatible 
`CovarianceAccumulator` and `VarianceAccumulator`. This keeps the partial 
aggregation state byte-compatible with the buffer layout Spark's planner 
declares for the partial to final shuffle:
     - `regr_sxx` / `regr_syy` reach Comet as `RegrReplacement` (a 
`CentralMomentAgg`, 3-field buffer) and reuse the variance accumulator, 
evaluating to `m2`.
     - `regr_sxy` (a population `Covariance`, 4-field buffer) reuses the 
covariance accumulator, evaluating to the co-moment `ck`.
     - `regr_r2` (a `PearsonCorrelation`, 6-field buffer) composes covariance + 
two variances.
     - `regr_slope` / `regr_intercept` (a declarative composite, 7-field 
buffer) compose population covariance + variance.
   - `regr_r2` matches Spark's behavior of returning `1.0` when the dependent 
variable is constant but the independent variable varies (a perfect horizontal 
fit). This is the one case where Spark and DataFusion's `regr_r2` diverge 
(DataFusion returns `null`), so a Comet-specific accumulator is warranted.
   - A new `Regr` protobuf message (with a `RegrType` enum) wired through 
`QueryPlanSerde` and the native `PhysicalPlanner`.
   - Scala serde for `RegrSlope`, `RegrIntercept`, `RegrR2`, `RegrSXY`, and 
`RegrReplacement` (the rewrite target for `regr_sxx`/`regr_syy`).
   - Updated the expression support status in the user guide.
   
   This work was scaffolded using the `implement-comet-expression` project 
skill.
   
   ## How are these changes tested?
   
   - Expanded the existing 
`spark/src/test/resources/sql-tests/expressions/aggregate/regr.sql` so all six 
functions are verified to run natively and match Spark (previously 
`spark_answer_only`). New cases cover NULL pairs, single-pair input, constant 
independent variable (slope/intercept/r2 to NULL), constant dependent variable 
(r2 to 1.0), grouped aggregation, and literal/column argument mixes.
   - Added Rust unit tests in `regr.rs` covering perfect-fit 
slope/intercept/r2, the constant-y and constant-x edge cases, single-pair and 
empty input, NULL-pair skipping, the raw moments, and partial-state merge 
across batches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to