Re: [PR] Handle empty data in CountAggregator (druid)

via GitHub Tue, 22 Jul 2025 21:20:57 -0700


clintropolis commented on code in PR #18307:
URL: https://github.com/apache/druid/pull/18307#discussion_r2224306600



##########
processing/src/main/java/org/apache/druid/query/aggregation/LongSumAggregatorFactory.java:
##########
@@ -36,25 +36,49 @@ public class LongSumAggregatorFactory extends 
SimpleLongAggregatorFactory
 {
   private final Supplier<byte[]> cacheKey;
 
+  /**
+   * If false, the aggregator will return null when there are no values to 
aggregate. This is the default behavior.
+   * Otherwise, the aggregator must return a default non-null value. This is 
only used internally and not user-facing.
+   */

Review Comment:
   similar to my comment on other javadoc in `NullableNumericAggregator`, this 
isn't strictly about init, since it also makes it so we call aggregate on null 
rows (and hopefully get a 0 in all cases)
   
   Also probably worth calling out that is used by count and other callers use 
at their own risk



##########
processing/src/main/java/org/apache/druid/query/timeseries/TimeseriesQueryEngine.java:
##########
@@ -181,7 +181,8 @@ private Sequence<Result<TimeseriesResultValue>> 
processVectorized(
               .simple(granularizer.getBucketIterable())
               .map(
                   bucketInterval -> {
-                    // Whether or not the current bucket is empty
+                    // If current bucket is empty and skipEmptyBuckets is 
true, return null.
+                    // Otherwise, always call aggregators.init, then 
aggregators.aggregateVector if there's any data.

Review Comment:
   >i'd think this is more like describing what the entire function does, 
considering that:
   
   It feels like its missing some stuff if that is the goal, like how we 
advance the cursor aggregating values until reaching the end of the cursor, 
using the granularizer to chunk up the __time values to do the time grouping if 
applicable. tbh i'm not really sure this block needed a comment at all
   
   >the init is being called in two places, so just want to call out early here
   
   init is only called in the second place to initialize the values if we 
didn't aggregate anything. The `emptyBucket` variable could also have been 
called aggsInitialized, or didAggregate or something, since the only reason we 
call init after the first loop is if we didn't aggregate anything, but 
emptyBucket makes sense in the context of timeseries
   
   >when non-vector, there's no such thing, so thought it might be nice to call 
out early
   
   the reason the non-vectorized engine doesn't use have an init is because it 
uses on heap aggregators, which are constructed initialized... if it used 
`BufferAggregator` instead it would need to have the same logic to initialize 
the value in the memory location it is given to store the aggregates like this 
one does



##########
processing/src/main/java/org/apache/druid/query/aggregation/NullableNumericAggregatorFactory.java:
##########
@@ -48,22 +48,35 @@
 public abstract class NullableNumericAggregatorFactory<T extends 
BaseNullableColumnValueSelector>
     extends AggregatorFactory
 {
+  /**
+   * If true, the aggregator will not return null values, even if there's no 
data to aggregate. In this case, it would
+   * rely on the concrete implementation to return a non-null value.
+   */

Review Comment:
   i think this would more accurately be described as something like "if true, 
this aggregator will not check for null inputs, instead directly delegating to 
the underlying aggregator" since it isn't really just specific to just the init 
value - since we arent wrapping the things with the nullable numeric stuff it 
means we will call `aggregate` on null rows too, which _should_ result in zeros 
being read.
   
   To be honest, I'm not completely sure this is generally safe, iirc some 
column value selectors might explode if we try to get a primitive value from a 
null row, so it might be worth calling out that this is only known to be safely 
used to make the count aggregator behave correctly when combining empty results.



##########
processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryRunnerTest.java:
##########
@@ -3399,26 +3399,70 @@ public void 
testGroupByWithUniquesAndPostAggWithSameName()
   @Test
   public void testGroupByWithCardinality()
   {
-    GroupByQuery query = makeQueryBuilder()
+    // arrange

Review Comment:
   i suppose we can agree to disagree, i think a line break and clear variable 
names can achieve the same organizational stuff without being extra text I need 
to ignore 😅 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Handle empty data in CountAggregator (druid)

Reply via email to