gonzalezzfelipe commented on issue #12616: URL: https://github.com/apache/druid/issues/12616#issuecomment-1167866693
TL;DR: For native queries, this can be solved by defining the dimensions that are not of type default as virtual columns, and treating them as regular dimensions. For SQL I wasn't able to find a direct solution. I got deeper into the problem and found the following: The `GROUPING` aggregator creates the grouping key [here](https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/aggregation/GroupingAggregatorFactory.java#L238-L274). This `keyDimensions` set is created [here](https://github.com/abhishekagarwal87/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java#L589-L606) upon aggregation to check against the `groupings`. As stated on the comments on that code, using `outputName` at the moment of aggregation is not possible, because aggregators are not aware of the output name of the output column names: ```java // KeyDimensionNames are the input column names of dimensions. Its required since aggregators are not aware of the // output column names. // As we exclude certain dimensions from the result row, the value for any grouping_id aggregators have to change // to reflect the new grouping dimensions, that aggregation is being done upon. We will mark the indices which have // grouping aggregators and update the value for each row at those indices. ``` Nevertheless, there is a way out of the rabbit hole. As shown [here](https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/GroupingSqlAggregator.java#L65-L76), the way that the `GROUPING` SQL function is converted into a Native query, is by extracting the dimension *or* virtual column that is being grouped. As shown in the code: ```java if (expression.isDirectColumnAccess()) { return expression.getDirectColumn(); } String virtualColumn = virtualColumnRegistry.getOrCreateVirtualColumnForExpression( expression, node.getType() ); return virtualColumn; ``` So it would seem that there would be no problem if one would be using virtual columns. Running an `EXPLAIN PLAN FOR` the query in the [previous comment](https://github.com/apache/druid/issues/12616#issuecomment-1163442788), shows that that SQL query is converted (roughly) to the following: ```json { "queryType": "groupBy", "dataSource": { "type": "table", "name": "datasource" }, "intervals": { "type": "intervals", "intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"] }, "granularity": { "type": "all" }, "dimensions": [ { "type": "extraction", "dimension": "id", "outputName": "foo", "outputType": "STRING", "extractionFn": { "type": "registeredLookup", "lookup": "lookup" } } ], "aggregations": [ { "type": "grouping", "name": "__grouping__", "groupings": [ "v0" ] }, { "type": "count", "name": "count" } ], "postAggregations": [], "subtotalsSpec": [ [ "foo" ], [] ] } ``` Note that on the `GROUPING` aggregator, the query is referring to a `v0` virtual column that is not added to the query. This is created on the `virtualColumnRegistry.getOrCreateVirtualColumnForExpression` on the snippet above, but for some reason that virtual column is not effectively added to the query. Later, when checking wether the corresponding dimension is being grouped on aggregator resolver, the `v0` virtual column is not found and that's why we get a `1` instead of a `0` for those cases. Nevertheless, if we tweek that query to include the virtual column and address it as such, then the function works as intended. ```json { "queryType": "groupBy", "dataSource": { "type": "table", "name": "datasource" }, "intervals": { "type": "intervals", "intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"] }, "granularity": { "type": "all" }, "virtualColumns": [ { "type": "expression", "name": "foo", "expression": "LOOKUP(id, 'lookup')", "outputType": "STRING" } ] "dimensions": [ { "type": "default", "dimension": "foo" } ], "aggregations": [ { "type": "grouping", "name": "__grouping__", "groupings": [ "foo" ] }, { "type": "count", "name": "count" } ], "postAggregations": [], "subtotalsSpec": [ [ "foo" ], [] ] } ``` So it looks like a possible solution for the SQL is understanding why the virtual column `v0` that is being "created" is not on the final native query, and making use of it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
