jcsherin commented on code in PR #17003:
URL: https://github.com/apache/datafusion/pull/17003#discussion_r2248726677
##########
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##########
@@ -360,9 +366,14 @@ impl ApproxPercentileAccumulator {
}
}
- // public for approx_percentile_cont_with_weight
+ // Merge new TDigests into this accumulator. Public for
approx_percentile_cont_with_weight.
+ //
+ // Important: max_size Preservation
+ // TDigest::merge_digests uses the max_size from the first digest in the
iterator.
+ // By putting self.digest first, we ensure the accumulator's configured
max_size
+ // is preserved rather than being overridden by the new digests' max_size.
Review Comment:
After some more thinking, I think we have to guarantee that in the context
of a single `approx_percentile_cont_with_weight` function call all the partial
aggregates are created with the same `max_size` value.
Therefore the `max_size` should be user provided value, or the default
argument if omitted.
The underlying t-digest sketch algorithm is flexible enough to allow merging
digests with different `max_size` values into one, where the final precision
depends on how merge is implemented internally.
Regardless of the algorithm's flexibility, we must define clear semantics at
the function call level for predictability.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]