[GitHub] [druid] gonzalezzfelipe commented on issue #12616: `grouping` aggregator should use output name instead of dimension name

GitBox Mon, 27 Jun 2022 13:41:03 -0700


gonzalezzfelipe commented on issue #12616:
URL: https://github.com/apache/druid/issues/12616#issuecomment-1167866693


   TL;DR: For native queries, this can be solved by defining the dimensions 
that are not of type default as virtual columns, and treating them as regular 
dimensions. For SQL I wasn't able to find a direct solution.
   
   I got deeper into the problem and found the following:
   
   The `GROUPING` aggregator creates the grouping key 
[here](https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/aggregation/GroupingAggregatorFactory.java#L238-L274).
 This `keyDimensions` set is created 
[here](https://github.com/abhishekagarwal87/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java#L589-L606)
 upon aggregation to check against the `groupings`. As stated on the comments 
on that code, using `outputName` at the moment of aggregation is not possible, 
because aggregators are not aware of the output name of the output column names:
   
   ```java
         // KeyDimensionNames are the input column names of dimensions. Its 
required since aggregators are not aware of the
         //  output column names.
         //  As we exclude certain dimensions from the result row, the value 
for any grouping_id aggregators have to change
         //  to reflect the new grouping dimensions, that aggregation is being 
done upon. We will mark the indices which have
         //  grouping aggregators and update the value for each row at those 
indices.
   ```
   
   Nevertheless, there is a way out of the rabbit hole. As shown 
[here](https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/GroupingSqlAggregator.java#L65-L76),
 the way that the `GROUPING` SQL function is converted into a Native query, is 
by extracting the dimension *or* virtual column that is being grouped. As shown 
in the code:
   
   ```java
       if (expression.isDirectColumnAccess()) {
         return expression.getDirectColumn();
       }
   
       String virtualColumn = 
virtualColumnRegistry.getOrCreateVirtualColumnForExpression(
           expression,
           node.getType()
       );
       return virtualColumn;
   ```
   
   So it would seem that there would be no problem if one would be using 
virtual columns. Running an `EXPLAIN PLAN FOR` the query in the [previous 
comment](https://github.com/apache/druid/issues/12616#issuecomment-1163442788), 
shows that that SQL query is converted (roughly) to the following:
   
   ```json
   {
     "queryType": "groupBy",
     "dataSource": {
       "type": "table",
       "name": "datasource"
     },
     "intervals": {
       "type": "intervals",
       "intervals": 
["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"]
     },
     "granularity": {
       "type": "all"
     },
     "dimensions": [
       {
         "type": "extraction",
         "dimension": "id",
         "outputName": "foo",
         "outputType": "STRING",
         "extractionFn": {
           "type": "registeredLookup",
           "lookup": "lookup"
         }
       }
     ],
     "aggregations": [
       {
         "type": "grouping",
         "name": "__grouping__",
         "groupings": [
           "v0"
         ]
       },
       {
         "type": "count",
         "name": "count"
       }
     ],
     "postAggregations": [],
     "subtotalsSpec": [
       [
         "foo"
       ],
       []
     ]
   }
   ```
   Note that on the `GROUPING` aggregator, the query is referring to a `v0` 
virtual column that is not added to the query. This is created on the 
`virtualColumnRegistry.getOrCreateVirtualColumnForExpression` on the snippet 
above, but for some reason that virtual column is not effectively added to the 
query. Later, when checking wether the corresponding dimension is being grouped 
on aggregator resolver, the `v0` virtual column is not found and that's why we 
get a `1` instead of a `0` for those cases. Nevertheless, if we tweek that 
query to include the virtual column and address it as such, then the function 
works as intended.
   
   ```json
   {
     "queryType": "groupBy",
     "dataSource": {
       "type": "table",
       "name": "datasource"
     },
     "intervals": {
       "type": "intervals",
       "intervals": 
["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"]
     },
     "granularity": {
       "type": "all"
     },
     "virtualColumns": [
       {
         "type": "expression",
         "name": "foo",
         "expression": "LOOKUP(id, 'lookup')",
         "outputType": "STRING"
       }
     ]
     "dimensions": [
       {
         "type": "default",
         "dimension": "foo"
       }
     ],
     "aggregations": [
       {
         "type": "grouping",
         "name": "__grouping__",
         "groupings": [
           "foo"
         ]
       },
       {
         "type": "count",
         "name": "count"
       }
     ],
     "postAggregations": [],
     "subtotalsSpec": [
       [
         "foo"
       ],
       []
     ]
   }
   ```
   
   So it looks like a possible solution for the SQL is understanding why the 
virtual column `v0` that is being "created" is not on the final native query, 
and making use of it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gonzalezzfelipe commented on issue #12616: `grouping` aggregator should use output name instead of dimension name

Reply via email to