[GitHub] [druid] superhawk610 opened a new issue, #12616: `grouping` aggregator should use output name instead of dimension name

GitBox Tue, 07 Jun 2022 13:40:58 -0700


superhawk610 opened a new issue, #12616:
URL: https://github.com/apache/druid/issues/12616


   ### Affected Version
   
   `2021.11.1-iap`
   
   ### Description
   
   When providing multiple sub-groups to `subtotalsSpec`, the Druid docs 
recommend using the `grouping` aggregator to differentiate between results. The 
`grouping` aggregator reports, for a given list of dimensions, whether or not 
that dimension is used in a given sub-groups totals. `subtotalsSpec`, however, 
allows providing any `outputName`, not just dimensions. Take this example:
   
   ```jsonc
   {
     "queryType": "groupBy",
     "granularity": "all",
     // .. (snip) ..
     "subtotalsSpec": [["a"], ["b"]],
     "aggregations": [
       {
         "type": "grouping",
         "name": "__grouping__",
         "groupings": ["a", "b"]
       }
     ],
     "dimensions": [
       {
         "type": "lookup",
         "dimension": "id",
         "outputName": "a",
         "lookup": {
           "type": "map",
           "map": { "1": "foo", "2": "foo", "3": "bar" }
         }
       },
       {
         "type": "lookup",
         "dimension": "id",
         "outputName": "b",
         "lookup": {
           "type": "map",
           // importantly, `id=2` is in a different sub-group depending on 
whether
           // we're grouping by `a` or `b` (even though the base dimension, 
`id`,
           // is the same in each case)
           "map": { "1": "X", "2": "Y", "3": "Z" }
         }
       }
     ]
   }
   ```
   
   I would expect this query to return results that look like this:
   
   ```json
   [
     // omitting the timestamp/version/event wrapper, but you get the idea
     {
       "__grouping__": 0b01,
       "a": "foo",
       "b": null,
       "views": 2 // some metric, doesn't really matter
     },
     {
       "__grouping__": 0b01,
       "a": "bar",
       "b": null,
       "views": 1
     },
     {
       "__grouping__": 0b10,
       "a": null,
       "b": "X",
       "views": 1
     },
     {
       "__grouping__": 0b10,
       "a": null,
       "b": "Y",
       "views": 2
     }
   ]
   ```
   
   However, `__grouping__` is `0b11` for all 4 results; since the "dimensions" 
`a` and `b` aren't used in any result (they're not dimensions, they're the 
output name for lookups). If I provide `id` to the `grouping` aggregator, its 
corresponding bit in the output will correctly be `0` for all rows, since it's 
used to generate both the `a` and `b` values, but this isn't helpful as I 
cannot differentiate which results are grouped by `a`, and which are grouped by 
`b`.
   
   I propose that the `grouping` aggregator allow specifying `outputName` 
instead of dimension name, to align 1:1 with how `subtotalsSpec` works and 
allow for differentiating between output sub-groups in cases like that 
illustrated above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] superhawk610 opened a new issue, #12616: `grouping` aggregator should use output name instead of dimension name

Reply via email to