Re: [PR] Count distinct returned incorrect results without useApproximateCountDistinct (druid)

via GitHub Mon, 21 Aug 2023 21:03:44 -0700


clintropolis commented on code in PR #14748:
URL: https://github.com/apache/druid/pull/14748#discussion_r1300877178



##########
processing/src/main/java/org/apache/druid/query/aggregation/NullableNumericAggregatorFactory.java:
##########
@@ -85,6 +85,12 @@ public final int getMaxIntermediateSizeWithNulls()
     return getMaxIntermediateSize() + (NullHandling.replaceWithDefault() ? 0 : 
Byte.BYTES);
   }
 
+  protected boolean canHandleNulls()
+  {
+    return NullHandling.replaceWithDefault();
+  }

Review Comment:
   This reminds me of an old PR of mine which I totally forgot about until 
reviewing this PR, #10666, which made a different but somewhat similar change.
   
   In that PR, I was adding a method called `hasNulls` to the nullable numeric 
aggregator (side note, i probably should resurrect this old PR), to separate 
processing in SQL compatible mode for columns which have nulls and those which 
don't, but which should still output a null value if not aggregated.
   
   The changes in that PR would not actually fix the problems here, and I think 
maybe help me frame what actually bothers me about the changes in this PR, 
since the changes here are somewhat contradictory to the changes there.
   
   Anyway, I think what bothers me about this change is that it is to handle 
logic that seems entirely specific to the special case of how COUNT 
aggregations are handled, and I can't actually see any good use for this 
otherwise, and because of that I'm not entirely sure it makes sense to add in 
this way. If we ignore the default value mode (which is now a legacy mode as of 
#14792), and consider the changes i made in the other PR, then this family of 
aggregators is for implementing math aggregator functions which may or may not 
have null values, but should always be null unless aggregated.
   
   COUNT however, is a special case in that it should be 0, even if not 
aggregated, so I think adding special handling to this family aggregators 
specifically for the 'merging' aggregator for COUNT seems like the wrong thing 
to do.
   
   Instead, I think i would be more in favor of just special case implementing 
the merging/combining aggregator for COUNT so that it always spits out zeros 
even if not aggregated, even if it means not sharing some of the same code that 
we use for SUM, because it is a special case that i can't see having a general 
use for other types of numeric aggregators, only COUNT. It probably still could 
re-use the base aggregator implementations for sum that are used in default 
value mode, just wouldn't impact the aggregator factory implementations.
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Count distinct returned incorrect results without useApproximateCountDistinct (druid)

Reply via email to