Re: [I] Deduplicate Spark function code with native/default datafusion function code [datafusion]

via GitHub Mon, 15 Dec 2025 04:43:32 -0800


kumarUjjawal commented on issue #17964:
URL: https://github.com/apache/datafusion/issues/17964#issuecomment-3655426962


   @Jefffrey I took a look at avg, and had some questions;
   This is the current behaviour:
   
    - Spark avg only handles numeric→Float64, non-distinct, and uses i64 count 
with Float64 sum. State schema is [sum: input_type, count: Int64].
   - DF avg supports decimals/durations/ints/floats, distinct, u64 counts, and 
richer accumulators/state.
   
   My Thoughts:
    1. Extract a configurable/shared avg in `datafusion_functions` (or a shared 
helper) that supports a “Spark mode” (i64 counts, state schema), but otherwise 
reuses the DF avg implementation (type coercion, distinct, groups accumulator).
   2. Replace the Spark avg implementation with a thin wrapper 
(`make_udaf_function!` style) over that shared avg, carrying only 
Spark-specific differences (e.g., count type or any ANSI-mode tweaks).
   3. If count type must stay `i64`(if this is what we want?), we can make it a 
small configuration knob in the shared code rather than a forked accumulator; 
otherwise align to DF’s u64 to remove more divergence.
   
   I would like to know what are your thoughts on this. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Deduplicate Spark function code with native/default datafusion function code [datafusion]

Reply via email to