alamb commented on code in PR #19941:
URL: https://github.com/apache/datafusion/pull/19941#discussion_r2743067144


##########
datafusion/functions-aggregate-common/src/tdigest.rs:
##########
@@ -110,7 +99,7 @@ pub struct TDigest {
     centroids: Vec<Centroid>,
     max_size: usize,
     sum: f64,
-    count: u64,
+    count: f64,

Review Comment:
   > If count were a u64, it couldn't accurately represent the sum of 
fractional weights, leading to inaccurate percentile calculations
   
   Thank you @sesteves 
   
   Interestingly in ClickHouse I see the count is updated with the current sum
   
   
https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L191
   
   ```c++
               count = sum + l_count; // Update count, it might be different 
due to += inaccuracy
   ```
   
   I see now this is similar to  how `TDigest::new_with_centroid` sets count to 
the existing weight
   
   
https://github.com/alamb/datafusion/blob/4b35f827725ddd47f739b3731b6a3ff2a10a3c2d/datafusion/functions-aggregate-common/src/tdigest.rs#L125-L124
   
   So it seems like `count` is actually some sort of weighted total rather than 
a count which confused me
   
   No changes needed. Thank you for the explanation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to