LATEST_BY for STRING columns without specifying maxStringBytes (#14848)

cwylie Tue, 22 Aug 2023 22:50:30 -0700

This is an automated email from the ASF dual-hosted git repository.

cwylie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new e806d09309 Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING 
columns without specifying maxStringBytes (#14848)
e806d09309 is described below

commit e806d09309942c776749d4364d5cbf3a62ac34d4
Author: Zoltan Haindrich <[email protected]>
AuthorDate: Wed Aug 23 07:50:19 2023 +0200

    Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without 
specifying maxStringBytes (#14848)
---
 docs/querying/sql-aggregations.md                  | 15 +++-----
 docs/querying/sql-functions.md                     | 22 +++---------
 .../builtin/EarliestLatestAnySqlAggregator.java    | 13 ++++---
 .../builtin/EarliestLatestBySqlAggregator.java     |  2 +-
 .../apache/druid/sql/calcite/CalciteQueryTest.java | 42 +++++++++++++++++++---
 5 files changed, 54 insertions(+), 40 deletions(-)

diff --git a/docs/querying/sql-aggregations.md 
b/docs/querying/sql-aggregations.md
index f9233d40f7..b6a6748e62 100644
--- a/docs/querying/sql-aggregations.md
+++ b/docs/querying/sql-aggregations.md
@@ -86,16 +86,11 @@ In the aggregation functions supported by Druid, only 
`COUNT`, `ARRAY_AGG`, and
 |`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See 
[stats extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` or `0` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
 |`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` or `0` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
 |`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` or `0` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. 
If `expr` comes from a relation with a timestamp column (like `__time` in a 
Druid datasource), the "earliest" is taken from the row with the overall 
earliest non-null value of the timestamp column. If the earliest non-null value 
of the timestamp column appears in multiple rows, the `expr` may be taken from 
any of those rows. If `expr` does not come from a relation with a timestamp, 
then it is simply the first  [...]
-|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. 
The `maxBytesPerString` parameter determines how much aggregation space to 
allocate per string. Strings longer than this limit are truncated. This 
parameter should be set as low as possible, since high values will lead to 
wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` 
(legacy mode)|
-|`EARLIEST_BY(expr, timestampExpr)`|Returns the earliest value of `expr`, 
which must be numeric. The earliest value of `expr` is taken from the row with 
the overall earliest non-null value of `timestampExpr`. If the earliest 
non-null value of `timestampExpr` appears in multiple rows, the `expr` may be 
taken from any of those rows.|`null` or `0` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`EARLIEST_BY(expr, timestampExpr, maxBytesPerString)`| Like 
`EARLIEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` 
parameter determines how much aggregation space to allocate per string. Strings 
longer than this limit are truncated. This parameter should be set as low as 
possible, since high values will lead to wasted memory.|`null` or `''` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. The 
`expr` must come from a relation with a timestamp column (like `__time` in a 
Druid datasource) and the "latest" is taken from the row with the overall 
latest non-null value of the timestamp column. If the latest non-null value of 
the timestamp column appears in multiple rows, the `expr` may be taken from any 
of those rows. |`null` or `0` if `druid.generic.useDefaultValueForNull=true` 
(legacy mode)|
-|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The 
`maxBytesPerString` parameter determines how much aggregation space to allocate 
per string. Strings longer than this limit are truncated. This parameter should 
be set as low as possible, since high values will lead to wasted memory.|`null` 
or `''` if `druid.generic.useDefaultValueForNull=false` (legacy mode)|
-|`LATEST_BY(expr, timestampExpr)`|Returns the latest value of `expr`, which 
must be numeric. The latest value of `expr` is taken from the row with the 
overall latest non-null value of `timestampExpr`. If the overall latest 
non-null value of `timestampExpr` appears in multiple rows, the `expr` may be 
taken from any of those rows.|`null` or `0` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`LATEST_BY(expr, timestampExpr, maxBytesPerString)`|Like `LATEST_BY(expr, 
timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines 
how much aggregation space to allocate per string. Strings longer than this 
limit are truncated. This parameter should be set as low as possible, since 
high values will lead to wasted memory.|`null` or `''` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be 
numeric. This aggregator can simplify and optimize the performance by returning 
the first encountered value (including null)|`null` or `0` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. 
The `maxBytesPerString` parameter determines how much aggregation space to 
allocate per string. Strings longer than this limit are truncated. This 
parameter should be set as low as possible, since high values will lead to 
wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` 
(legacy mode)|
+|`EARLIEST(expr, [maxBytesPerValue])`|Returns the earliest value of `expr`.<br 
/>If `expr` comes from a relation with a timestamp column (like `__time` in a 
Druid datasource), the "earliest" is taken from the row with the overall 
earliest non-null value of the timestamp column.<br />If the earliest non-null 
value of the timestamp column appears in multiple rows, the `expr` may be taken 
from any of those rows. If `expr` does not come from a relation with a 
timestamp, then it is simply the [...]
+|`EARLIEST_BY(expr, timestampExpr, [maxBytesPerValue])`|Returns the earliest 
value of `expr`.<br />The earliest value of `expr` is taken from the row with 
the overall earliest non-null value of `timestampExpr`. <br />If the earliest 
non-null value of `timestampExpr` appears in multiple rows, the `expr` may be 
taken from any of those rows.<br /><br />If `expr` is a string or complex type 
`maxBytesPerValue` amount of space is allocated for the aggregation. Strings 
longer than this limit ar [...]
+|`LATEST(expr, [maxBytesPerValue])`|Returns the latest value of `expr`<br 
/>The `expr` must come from a relation with a timestamp column (like `__time` 
in a Druid datasource) and the "latest" is taken from the row with the overall 
latest non-null value of the timestamp column.<br />If the latest non-null 
value of the timestamp column appears in multiple rows, the `expr` may be taken 
from any of those rows.<br /><br />If `expr` is a string or complex type 
`maxBytesPerValue` amount of spac [...]
+|`LATEST_BY(expr, timestampExpr, [maxBytesPerValue])`|Returns the latest value 
of `expr`.<br />The latest value of `expr` is taken from the row with the 
overall latest non-null value of `timestampExpr`.<br />If the overall latest 
non-null value of `timestampExpr` appears in multiple rows, the `expr` may be 
taken from any of those rows.<br /><br />If `expr` is a string or complex type 
`maxBytesPerValue` amount of space is allocated for the aggregation. Strings 
longer than this limit are t [...]
+|`ANY_VALUE(expr, [maxBytesPerValue])`|Returns any value of `expr` including 
null. This aggregator can simplify and optimize the performance by returning 
the first encountered value (including `null`).<br /><br />If `expr` is a 
string or complex type `maxBytesPerValue` amount of space is allocated for the 
aggregation. Strings longer than this limit are truncated. The 
`maxBytesPerValue` parameter should be set as low as possible, since high 
values will lead to wasted memory.<br/>If `maxBy [...]
 |`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy 
dimension is included in a row, when using `GROUPING SETS`. Refer to 
[additional documentation](aggregations.md#grouping-aggregator) on how to infer 
this number.|N/A|
 |`ARRAY_AGG(expr, [size])`|Collects all values of `expr` into an ARRAY, 
including null values, with `size` in bytes limit on aggregation size (default 
of 1024 bytes). If the aggregated array grows larger than the maximum size in 
bytes, the query will fail. Use of `ORDER BY` within the `ARRAY_AGG` expression 
is not currently supported, and the ordering of results within the output array 
may vary depending on processing order.|`null`|
 |`ARRAY_AGG(DISTINCT expr, [size])`|Collects all distinct values of `expr` 
into an ARRAY, including null values, with `size` in bytes limit on aggregation 
size (default of 1024 bytes) per aggregate. If the aggregated array grows 
larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` 
within the `ARRAY_AGG` expression is not currently supported, and the ordering 
of results will be based on the default for the element type.|`null`|
diff --git a/docs/querying/sql-functions.md b/docs/querying/sql-functions.md
index f936610e16..3df24ea607 100644
--- a/docs/querying/sql-functions.md
+++ b/docs/querying/sql-functions.md
@@ -50,11 +50,7 @@ Calculates the arc cosine of a numeric expression.
 
 ## ANY_VALUE
 
-`ANY_VALUE(<NUMERIC>)`
-
-`ANY_VALUE(<BOOLEAN>)`
-
-`ANY_VALUE(<CHARACTER>, <NUMERIC>)`
+`ANY_VALUE(expr, [maxBytesPerValue])`
 
 **Function type:** [Aggregation](sql-aggregations.md)
 
@@ -641,9 +637,7 @@ Returns a union of Tuple sketches which each contain an 
array of double values a
 
 ## EARLIEST
 
-`EARLIEST(expr)`
-
-`EARLIEST(expr, maxBytesPerString)`
+`EARLIEST(expr, [maxBytesPerValue])`
 
 **Function type:** [Aggregation](sql-aggregations.md)
 
@@ -651,9 +645,7 @@ Returns the value of a numeric or string expression 
corresponding to the earlies
 
 ## EARLIEST_BY
 
-`EARLIEST_BY(expr, timestampExpr)`
-
-`EARLIEST_BY(expr, timestampExpr, maxBytesPerString)`
+`EARLIEST_BY(expr, timestampExpr, [maxBytesPerValue])`
 
 **Function type:** [Aggregation](sql-aggregations.md)
 
@@ -837,9 +829,7 @@ Extracts a literal value from `expr` at the specified 
`path`. If you specify `RE
 
 ## LATEST
 
-`LATEST(expr)`
-
-`LATEST(expr, maxBytesPerString)`
+`LATEST(expr, [maxBytesPerValue])`
 
 **Function type:** [Aggregation](sql-aggregations.md)
 
@@ -847,9 +837,7 @@ Returns the value of a numeric or string expression 
corresponding to the latest
 
 ## LATEST_BY
 
-`LATEST_BY(expr, timestampExpr)`
-
-`LATEST_BY(expr, timestampExpr, maxBytesPerString)`
+`LATEST_BY(expr, timestampExpr, [maxBytesPerValue])`
 
 **Function type:** [Aggregation](sql-aggregations.md)
 
diff --git 
a/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java
 
b/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java
index 0137689a85..5f1b3c3228 100644
--- 
a/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java
+++ 
b/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java
@@ -82,7 +82,7 @@ public class EarliestLatestAnySqlAggregator implements 
SqlAggregator
           String fieldName,
           String timeColumn,
           ColumnType type,
-          int maxStringBytes
+          Integer maxStringBytes
       )
       {
         switch (type.getType()) {
@@ -108,7 +108,7 @@ public class EarliestLatestAnySqlAggregator implements 
SqlAggregator
           String fieldName,
           String timeColumn,
           ColumnType type,
-          int maxStringBytes
+          Integer maxStringBytes
       )
       {
         switch (type.getType()) {
@@ -134,7 +134,7 @@ public class EarliestLatestAnySqlAggregator implements 
SqlAggregator
           String fieldName,
           String timeColumn,
           ColumnType type,
-          int maxStringBytes
+          Integer maxStringBytes
       )
       {
         switch (type.getType()) {
@@ -157,7 +157,7 @@ public class EarliestLatestAnySqlAggregator implements 
SqlAggregator
         String fieldName,
         String timeColumn,
         ColumnType outputType,
-        int maxStringBytes
+        Integer maxStringBytes
     );
   }
 
@@ -236,7 +236,7 @@ public class EarliestLatestAnySqlAggregator implements 
SqlAggregator
     final AggregatorFactory theAggFactory;
     switch (args.size()) {
       case 1:
-        theAggFactory = aggregatorType.createAggregatorFactory(aggregatorName, 
fieldName, null, outputType, -1);
+        theAggFactory = aggregatorType.createAggregatorFactory(aggregatorName, 
fieldName, null, outputType, null);
         break;
       case 2:
         int maxStringBytes;
@@ -327,8 +327,7 @@ public class EarliestLatestAnySqlAggregator implements 
SqlAggregator
           EARLIEST_LATEST_ARG0_RETURN_TYPE_INFERENCE,
           InferTypes.RETURN_TYPE,
           OperandTypes.or(
-              OperandTypes.NUMERIC,
-              OperandTypes.BOOLEAN,
+              OperandTypes.ANY,
               OperandTypes.sequence(
                   "'" + aggregatorType.name() + "(expr, maxBytesPerString)'",
                   OperandTypes.ANY,
diff --git 
a/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestBySqlAggregator.java
 
b/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestBySqlAggregator.java
index 2c2e379d96..95b70e1f1e 100644
--- 
a/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestBySqlAggregator.java
+++ 
b/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestBySqlAggregator.java
@@ -128,7 +128,7 @@ public class EarliestLatestBySqlAggregator implements 
SqlAggregator
                 rexNodes.get(1)
             ),
             outputType,
-            -1
+            null
         );
         break;
       case 3:
diff --git 
a/sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java 
b/sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
index 179da414c5..e1aab8de41 100644
--- a/sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
+++ b/sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
@@ -13957,13 +13957,13 @@ public class CalciteQueryTest extends 
BaseCalciteQueryTest
                         .setVirtualColumns(
                             expressionVirtualColumn(
                                 "v0",
-                                "CAST(greatest(\"dim1\",\"dim2\"), 'DOUBLE')",
-                                ColumnType.DOUBLE
+                                "greatest(\"dim1\",\"dim2\")",
+                                ColumnType.STRING
                             )
                         )
                         .setGranularity(Granularities.ALL)
                         .addDimension(new DefaultDimensionSpec("l1", "_d0", 
ColumnType.LONG))
-                        .addAggregator(new DoubleLastAggregatorFactory("a0", 
"v0", null))
+                        .addAggregator(new StringLastAggregatorFactory("a0", 
"v0", null, 1024))
                         .setPostAggregatorSpecs(ImmutableList.of(
                             expressionPostAgg("p0", "isnull(\"a0\")")
                         ))
@@ -13976,9 +13976,9 @@ public class CalciteQueryTest extends 
BaseCalciteQueryTest
             new Object[]{325323L, false}
         ) :
         ImmutableList.of(
-            new Object[]{null, true},
+            new Object[]{null, false},
             new Object[]{0L, false},
-            new Object[]{7L, true},
+            new Object[]{7L, false},
             new Object[]{325323L, false}
         )
     );
@@ -14269,4 +14269,36 @@ public class CalciteQueryTest extends 
BaseCalciteQueryTest
         )
     );
   }
+
+  @Test
+  public void testLatestByOnStringColumnWithoutMaxBytesSpecified()
+  {
+    String defaultString = useDefault ? "" : null;
+    cannotVectorize();
+    testQuery(
+        "SELECT dim2,LATEST(dim3),LATEST_BY(dim1, 
__time),EARLIEST(dim3),EARLIEST_BY(dim1, __time),ANY_VALUE(dim3) FROM druid.foo 
where dim2='abc' group by 1",
+        ImmutableList.of(
+            GroupByQuery.builder()
+                .setDataSource(CalciteTests.DATASOURCE1)
+                .setInterval(querySegmentSpec(Filtration.eternity()))
+                .setGranularity(Granularities.ALL)
+                .setVirtualColumns(
+                    expressionVirtualColumn("v0", "'abc'", ColumnType.STRING))
+                .setDimFilter(equality("dim2", "abc", ColumnType.STRING))
+                .setDimensions(
+                    dimensions(new DefaultDimensionSpec("v0", "d0", 
ColumnType.STRING)))
+                .setAggregatorSpecs(
+                    aggregators(
+                        new StringLastAggregatorFactory("a0", "dim3", 
"__time", 1024),
+                        new StringLastAggregatorFactory("a1", "dim1", 
"__time", 1024),
+                        new StringFirstAggregatorFactory("a2", "dim3", 
"__time", 1024),
+                        new StringFirstAggregatorFactory("a3", "dim1", 
"__time", 1024),
+                        new StringAnyAggregatorFactory("a4", "dim3", 1024)))
+                .build()
+
+        ),
+        ImmutableList.of(
+            new Object[] {"abc", defaultString, "def", defaultString, "def", 
defaultString}
+        ));
+  }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes (#14848)

Reply via email to