clintropolis commented on code in PR #18307:
URL: https://github.com/apache/druid/pull/18307#discussion_r2224306600
##########
processing/src/main/java/org/apache/druid/query/aggregation/LongSumAggregatorFactory.java:
##########
@@ -36,25 +36,49 @@ public class LongSumAggregatorFactory extends
SimpleLongAggregatorFactory
{
private final Supplier<byte[]> cacheKey;
+ /**
+ * If false, the aggregator will return null when there are no values to
aggregate. This is the default behavior.
+ * Otherwise, the aggregator must return a default non-null value. This is
only used internally and not user-facing.
+ */
Review Comment:
similar to my comment on other javadoc in `NullableNumericAggregator`, this
isn't strictly about init, since it also makes it so we call aggregate on null
rows (and hopefully get a 0 in all cases)
Also probably worth calling out that is used by count and other callers use
at their own risk
##########
processing/src/main/java/org/apache/druid/query/timeseries/TimeseriesQueryEngine.java:
##########
@@ -181,7 +181,8 @@ private Sequence<Result<TimeseriesResultValue>>
processVectorized(
.simple(granularizer.getBucketIterable())
.map(
bucketInterval -> {
- // Whether or not the current bucket is empty
+ // If current bucket is empty and skipEmptyBuckets is
true, return null.
+ // Otherwise, always call aggregators.init, then
aggregators.aggregateVector if there's any data.
Review Comment:
>i'd think this is more like describing what the entire function does,
considering that:
It feels like its missing some stuff if that is the goal, like how we
advance the cursor aggregating values until reaching the end of the cursor,
using the granularizer to chunk up the __time values to do the time grouping if
applicable. tbh i'm not really sure this block needed a comment at all
>the init is being called in two places, so just want to call out early here
init is only called in the second place to initialize the values if we
didn't aggregate anything. The `emptyBucket` variable could also have been
called aggsInitialized, or didAggregate or something, since the only reason we
call init after the first loop is if we didn't aggregate anything, but
emptyBucket makes sense in the context of timeseries
>when non-vector, there's no such thing, so thought it might be nice to call
out early
the reason the non-vectorized engine doesn't use have an init is because it
uses on heap aggregators, which are constructed initialized... if it used
`BufferAggregator` instead it would need to have the same logic to initialize
the value in the memory location it is given to store the aggregates like this
one does
##########
processing/src/main/java/org/apache/druid/query/aggregation/NullableNumericAggregatorFactory.java:
##########
@@ -48,22 +48,35 @@
public abstract class NullableNumericAggregatorFactory<T extends
BaseNullableColumnValueSelector>
extends AggregatorFactory
{
+ /**
+ * If true, the aggregator will not return null values, even if there's no
data to aggregate. In this case, it would
+ * rely on the concrete implementation to return a non-null value.
+ */
Review Comment:
i think this would more accurately be described as something like "if true,
this aggregator will not check for null inputs, instead directly delegating to
the underlying aggregator" since it isn't really just specific to just the init
value - since we arent wrapping the things with the nullable numeric stuff it
means we will call `aggregate` on null rows too, which _should_ result in zeros
being read.
To be honest, I'm not completely sure this is generally safe, iirc some
column value selectors might explode if we try to get a primitive value from a
null row, so it might be worth calling out that this is only known to be safely
used to make the count aggregator behave correctly when combining empty results.
##########
processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryRunnerTest.java:
##########
@@ -3399,26 +3399,70 @@ public void
testGroupByWithUniquesAndPostAggWithSameName()
@Test
public void testGroupByWithCardinality()
{
- GroupByQuery query = makeQueryBuilder()
+ // arrange
Review Comment:
i suppose we can agree to disagree, i think a line break and clear variable
names can achieve the same organizational stuff without being extra text I need
to ignore 😅
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]