[GitHub] [druid] clintropolis commented on a change in pull request #11188: SQL timeseries no longer skip empty buckets with all granularity

GitBox Mon, 03 May 2021 11:24:57 -0700


clintropolis commented on a change in pull request #11188:
URL: https://github.com/apache/druid/pull/11188#discussion_r625277167




##########
File path: docs/querying/sql.md
##########
@@ -313,46 +313,48 @@ possible for two aggregators in the same SQL query to 
have different filters.
 
 Only the COUNT aggregation can accept DISTINCT.
 
+When no rows are selected, aggregate functions will return their initialized 
value for the grouping they belong to. What this value is exactly for a given 
aggregator is dependent on the configuration of Druid's SQL compatible null 
handling mode, controlled by `druid.generic.useDefaultValueForNull`. The table 
below defines the initial values for all aggregate functions in both modes.
+
 > The order of aggregation operations across segments is not deterministic. 
 > This means that non-commutative aggregation
 > functions can produce inconsistent results across the same query. 
 >
 > Functions that operate on an input type of "float" or "double" may also see 
 > these differences in aggregation
 > results across multiple query runs because of this. If precisely the same 
 > value is desired across multiple query runs,
 > consider using the `ROUND` function to smooth out the inconsistencies 
 > between queries.  
 
-|Function|Notes|
-|--------|-----|
-|`COUNT(*)`|Counts the number of rows.|
-|`COUNT(DISTINCT expr)`|Counts distinct values of expr, which can be string, 
numeric, or hyperUnique. By default this is approximate, using a variant of 
[HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). To 
get exact counts set "useApproximateCountDistinct" to "false". If you do this, 
expr must be string or numeric, since exact counts are not possible using 
hyperUnique columns. See also `APPROX_COUNT_DISTINCT(expr)`. In exact mode, 
only one distinct count per query is permitted unless 
`useGroupingSetForExactDistinct` is set to true in query contexts or broker 
configurations.|
-|`SUM(expr)`|Sums numbers.|
-|`MIN(expr)`|Takes the minimum of numbers.|
-|`MAX(expr)`|Takes the maximum of numbers.|
-|`AVG(expr)`|Averages numbers.|
-|`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a 
regular column or a hyperUnique column. This is always approximate, regardless 
of the value of "useApproximateCountDistinct". This uses Druid's built-in 
"cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT expr)`.|
-|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct 
values of expr, which can be a regular column or an [HLL 
sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK` 
and `tgtHllType` parameters are described in the HLL sketch documentation. This 
is always approximate, regardless of the value of 
"useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The 
[DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|
-|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of 
expr, which can be a regular column or a [Theta 
sketch](../development/extensions-core/datasketches-theta.md) column. The 
`size` parameter is described in the Theta sketch documentation. This is always 
approximate, regardless of the value of "useApproximateCountDistinct". See also 
`COUNT(DISTINCT expr)`. The [DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|
-|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL 
sketch](../development/extensions-core/datasketches-hll.md) on the values of 
expr, which can be a regular column or a column containing HLL sketches. The 
`lgK` and `tgtHllType` parameters are described in the HLL sketch 
documentation. The [DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|
-|`DS_THETA(expr, [size])`|Creates a [Theta 
sketch](../development/extensions-core/datasketches-theta.md) on the values of 
expr, which can be a regular column or a column containing Theta sketches. The 
`size` parameter is described in the Theta sketch documentation. The 
[DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|
-|`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate 
quantiles on numeric or 
[approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator)
 exprs. The "probability" should be between 0 and 1 (exclusive). The 
"resolution" is the number of centroids to use for the computation. Higher 
resolutions will give more precise results but also have higher overhead. If 
not provided, the default resolution is 50. The [approximate histogram 
extension](../development/extensions-core/approximate-histograms.md) must be 
loaded to use this function.|
-|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles 
on numeric or [Quantiles 
sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The 
"probability" should be between 0 and 1 (exclusive). The `k` parameter is 
described in the Quantiles sketch documentation. The [DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|
-|`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, 
upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric 
or [fixed buckets 
histogram](../development/extensions-core/approximate-histograms.md#fixed-buckets-histogram)
 exprs. The "probability" should be between 0 and 1 (exclusive). The 
`numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters 
are described in the fixed buckets histogram documentation. The [approximate 
histogram extension](../development/extensions-core/approximate-histograms.md) 
must be loaded to use this function.|
-|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles 
sketch](../development/extensions-core/datasketches-quantiles.md) on the values 
of expr, which can be a regular column or a column containing quantiles 
sketches. The `k` parameter is described in the Quantiles sketch documentation. 
The [DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|
-|`BLOOM_FILTER(expr, numEntries)`|Computes a bloom filter from values produced 
by `expr`, with `numEntries` maximum number of distinct values before false 
positive rate increases. See [bloom filter 
extension](../development/extensions-core/bloom-filter.md) documentation for 
additional details.|
-|`TDIGEST_QUANTILE(expr, quantileFraction, [compression])`|Builds a T-Digest 
sketch on values produced by `expr` and returns the value for the quantile. 
Compression parameter (default value 100) determines the accuracy and size of 
the sketch. Higher compression means higher accuracy but more space to store 
sketches. See [t-digest 
extension](../development/extensions-contrib/tdigestsketch-quantiles.md) 
documentation for additional details.|
-|`TDIGEST_GENERATE_SKETCH(expr, [compression])`|Builds a T-Digest sketch on 
values produced by `expr`. Compression parameter (default value 100) determines 
the accuracy and size of the sketch Higher compression means higher accuracy 
but more space to store sketches. See [t-digest 
extension](../development/extensions-contrib/tdigestsketch-quantiles.md) 
documentation for additional details.|
-|`VAR_POP(expr)`|Computes variance population of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|
-|`VAR_SAMP(expr)`|Computes variance sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|
-|`VARIANCE(expr)`|Computes variance sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|
-|`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See 
[stats extension](../development/extensions-core/stats.md) documentation for 
additional details.|
-|`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|
-|`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|
-|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. 
If `expr` comes from a relation with a timestamp column (like a Druid 
datasource) then "earliest" is the value first encountered with the minimum 
overall timestamp of all values being aggregated. If `expr` does not come from 
a relation with a timestamp, then it is simply the first value encountered.|
-|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. 
The `maxBytesPerString` parameter determines how much aggregation space to 
allocate per string. Strings longer than this limit will be truncated. This 
parameter should be set as low as possible, since high values will lead to 
wasted memory.|
-|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If 
`expr` comes from a relation with a timestamp column (like a Druid datasource) 
then "latest" is the value last encountered with the maximum overall timestamp 
of all values being aggregated. If `expr` does not come from a relation with a 
timestamp, then it is simply the last value encountered.|
-|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The 
`maxBytesPerString` parameter determines how much aggregation space to allocate 
per string. Strings longer than this limit will be truncated. This parameter 
should be set as low as possible, since high values will lead to wasted memory.|
-|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be 
numeric. This aggregator can simplify and optimize the performance by returning 
the first encountered value (including null)|
-|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. 
The `maxBytesPerString` parameter determines how much aggregation space to 
allocate per string. Strings longer than this limit will be truncated. This 
parameter should be set as low as possible, since high values will lead to 
wasted memory.|
-|`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy 
dimension is included in a row, when using `GROUPING SETS`. Refer to 
[additional documentation](aggregations.md#grouping-aggregator) on how to infer 
this number.|
+|Function|Notes|Default|
+|--------|-----|-------|
+|`COUNT(*)`|Counts the number of rows.|`0`|
+|`COUNT(DISTINCT expr)`|Counts distinct values of expr, which can be string, 
numeric, or hyperUnique. By default this is approximate, using a variant of 
[HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). To 
get exact counts set "useApproximateCountDistinct" to "false". If you do this, 
expr must be string or numeric, since exact counts are not possible using 
hyperUnique columns. See also `APPROX_COUNT_DISTINCT(expr)`. In exact mode, 
only one distinct count per query is permitted unless 
`useGroupingSetForExactDistinct` is set to true in query contexts or broker 
configurations.|`0`|
+|`SUM(expr)`|Sums numbers.|`0` in 'default' mode, `null` in SQL compatible 
mode|
+|`MIN(expr)`|Takes the minimum of numbers.|`Long.MAX_VALUE` in 'default' mode, 
`null` in SQL compatible mode|
+|`MAX(expr)`|Takes the maximum of numbers.|`Long.MIN_VALUE` in 'default' mode, 
`null` in SQL compatible mode|
+|`AVG(expr)`|Averages numbers.|`0` in 'default' mode, `null` in SQL compatible 
mode|
+|`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a 
regular column or a hyperUnique column. This is always approximate, regardless 
of the value of "useApproximateCountDistinct". This uses Druid's built-in 
"cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT 
expr)`.|`0`|
+|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct 
values of expr, which can be a regular column or an [HLL 
sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK` 
and `tgtHllType` parameters are described in the HLL sketch documentation. This 
is always approximate, regardless of the value of 
"useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The 
[DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|`0`|
+|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of 
expr, which can be a regular column or a [Theta 
sketch](../development/extensions-core/datasketches-theta.md) column. The 
`size` parameter is described in the Theta sketch documentation. This is always 
approximate, regardless of the value of "useApproximateCountDistinct". See also 
`COUNT(DISTINCT expr)`. The [DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|`0`|
+|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL 
sketch](../development/extensions-core/datasketches-hll.md) on the values of 
expr, which can be a regular column or a column containing HLL sketches. The 
`lgK` and `tgtHllType` parameters are described in the HLL sketch 
documentation. The [DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|`'0'` (STRING)|
+|`DS_THETA(expr, [size])`|Creates a [Theta 
sketch](../development/extensions-core/datasketches-theta.md) on the values of 
expr, which can be a regular column or a column containing Theta sketches. The 
`size` parameter is described in the Theta sketch documentation. The 
[DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|`'0.0'` (STRING)|

Review comment:
       Hmm, it actually returns a double, but we don't examine the finalized 
type so calcite thinks it is complex which is I guess how it ends up as a 
string instead of double because it just tries to serialize complex values (and 
I assume the same is true of the other sketches that return a string result). 
Theta sketch aggregator, along with many of the quantiles finalize their 
sketches by default, producing the estimate, instead of serializing the sketch 
like some of the other complex types do. A bit more discussion about it here 
#9419, but doesn't really look like it went anywhere.
   
   I guess this also brings up the question of if we need to describe the 
difference between intermediary types and finalized types here (since this 
table is currently showing finalized types is why I called it the default value 
since it is what people will see, instead of like the initialized value of the 
aggregator which for all of these complex types is going to be an empty sketch).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] [druid] clintropolis commented on a change in pull request #11188: SQL timeseries no longer skip empty buckets with all granularity

Reply via email to