Re: [PR] Support `centroids` config for `approx_percentile_cont_with_weight` [datafusion]

via GitHub Fri, 01 Aug 2025 02:45:04 -0700


crepererum commented on code in PR #17003:
URL: https://github.com/apache/datafusion/pull/17003#discussion_r2247473083



##########
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##########
@@ -360,9 +366,14 @@ impl ApproxPercentileAccumulator {
         }
     }
 
-    // public for approx_percentile_cont_with_weight
+    // Merge new TDigests into this accumulator. Public for 
approx_percentile_cont_with_weight.
+    //
+    // Important: max_size Preservation
+    // TDigest::merge_digests uses the max_size from the first digest in the 
iterator.
+    // By putting self.digest first, we ensure the accumulator's configured 
max_size
+    // is preserved rather than being overridden by the new digests' max_size.

Review Comment:
   ```suggestion
       /// Merge new TDigests into this accumulator. Public for 
`approx_percentile_cont_with_weight`.
       ///
       /// Important: `max_size` Preservation
       /// [`TDigest::merge_digests`] uses the `max_size` from the first digest 
in the iterator.
       /// By putting self.digest first, we ensure the accumulator's configured 
`max_size`
       /// is preserved rather than being overridden by the new digests' 
`max_size`.
   ```
   
   Should we make this a real docstring, so that IDEs and rustdoc can see it?



##########
docs/source/user-guide/sql/aggregate_functions.md:
##########
@@ -1039,7 +1039,7 @@ approx_median(expression)
 Returns the approximate percentile of input values using the t-digest 
algorithm.
 
 ```sql
-approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY 
expression)
+approx_percentile_cont(percentile [, centroids]) WITHIN GROUP (ORDER BY 
expression)

Review Comment:
   I like that change. I was really confused when I read the docs the first 
time but saw queries with 1 argument. I think this is easier to understand 👍 



##########
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##########
@@ -360,9 +366,14 @@ impl ApproxPercentileAccumulator {
         }
     }
 
-    // public for approx_percentile_cont_with_weight
+    // Merge new TDigests into this accumulator. Public for 
approx_percentile_cont_with_weight.
+    //
+    // Important: max_size Preservation
+    // TDigest::merge_digests uses the max_size from the first digest in the 
iterator.
+    // By putting self.digest first, we ensure the accumulator's configured 
max_size
+    // is preserved rather than being overridden by the new digests' max_size.

Review Comment:
   Behavior-wise, I wonder if we should change it to "uses the max of 
`max_size` over all inputs" instead of "uses the first input". The reason is 
that orders within the planner and in queries are often not that well-defined, 
and depending on it easily leads to nondeterministic results that are hard to 
debug.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Support `centroids` config for `approx_percentile_cont_with_weight` [datafusion]

Reply via email to