jcsherin commented on code in PR #17003:
URL: https://github.com/apache/datafusion/pull/17003#discussion_r2248567975


##########
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##########
@@ -360,9 +366,14 @@ impl ApproxPercentileAccumulator {
         }
     }
 
-    // public for approx_percentile_cont_with_weight
+    // Merge new TDigests into this accumulator. Public for 
approx_percentile_cont_with_weight.
+    //
+    // Important: max_size Preservation
+    // TDigest::merge_digests uses the max_size from the first digest in the 
iterator.
+    // By putting self.digest first, we ensure the accumulator's configured 
max_size
+    // is preserved rather than being overridden by the new digests' max_size.

Review Comment:
   > 3\. Update `TDigest::merge_digests` to take the largest `max_size` among 
inputs.
   
   It looks like there is an assumption implicit in the merging code. The 
instances which are being merged are created with the same `max_size` value.
   
   
https://github.com/apache/datafusion/blob/6ea01d13362f33aca5434b18c632fe2f43e60ab9/datafusion/functions-aggregate-common/src/tdigest.rs#L393C9-L393C58
   
   If the above is not true, then the following problems come up as rightly 
called out here:
   
   > * Safer if callers forget to align configs, but might not be that 
performant than taking the first config?
   > 
   > * A very large `max_size` could override user-defined config?
   
   But if the `max_size` value will not change between t-digest instances then 
we do not have an ordering problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to