njsmith opened a new pull request, #20019:
URL: https://github.com/apache/datafusion/pull/20019
DataFusion's current `AggregateMode` enum has four variants covering three
of the four cells in the input/output matrix:
| Input: raw data | Input:
partial state
Output: final values | `Single` / `SinglePartitioned` | `Final` /
`FinalPartitioned`
Output: partial state | `Partial` | ???
This PR adds `AggregateMode::PartialReduce` to fill in the missing cell: it
takes partially-reduced values as input, and reduces them further, but without
finalizing.
This is useful because it's the key component needed to implement
distributed tree-reduction (as seen in e.g. the Scuba or Honeycomb papers): a
set of worker nodes each perform multithreaded `Partial` aggregations, feed
those into a `PartialReduce` to reduce all of this node's values into a single
row, and then a head node collects the outputs from all nodes' `PartialReduce`
to feed into a `Final` reduction.
PR can be reviewed commit by commit: first commit is pure
refactor/simplification; most places we were matching on `AggregateMode` we
were actually just trying to either check which row of the above table we were
in, or else which column. So now we have `is_first_stage` (tells you which
column) and `is_last_stage` (tells you which row) and we use them everywhere.
Second commit adds `PartialReduce`, and is pretty small because
`is_first_stage`/`is_last_stage` do most of the heavy lifting. It also adds a
test demonstrating a minimal Partial -> PartialReduce -> Final tree-reduction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]