liamzwbao commented on code in PR #17003: URL: https://github.com/apache/datafusion/pull/17003#discussion_r2248523303
########## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ########## @@ -360,9 +366,14 @@ impl ApproxPercentileAccumulator { } } - // public for approx_percentile_cont_with_weight + // Merge new TDigests into this accumulator. Public for approx_percentile_cont_with_weight. + // + // Important: max_size Preservation + // TDigest::merge_digests uses the max_size from the first digest in the iterator. + // By putting self.digest first, we ensure the accumulator's configured max_size + // is preserved rather than being overridden by the new digests' max_size. Review Comment: I changed the order of the iteration so `DEFAULT_MAX_SIZE` no longer needs to change in `update_batch`. Which approach should we take? 1. Keep the current approach - Use the `max_size` from the first `TDigest`. - Re-order merges so inner configs aren’t overwritten. 2. Caller sets `max_size` (from @jcsherin) - Have `ApproxPercentileWithWeightAccumulator::update_batch` assign a uniform `max_size` to every `TDigest`. - Roll back the change in this function. 3. Take the max `max_size` (from @crepererum) - Update `TDigest::merge_digests` to take the largest `max_size` among inputs. - Safer if callers forget to align configs, but might not be that performant than taking the first config? - A very large `max_size` could override user-defined config? Would appreciate ideas on different approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org