Alexander Behm created IMPALA-6179:
--------------------------------------
Summary: Constant argument to UDAF not accessible in merge phase
of distributed execution
Key: IMPALA-6179
URL: https://issues.apache.org/jira/browse/IMPALA-6179
Project: IMPALA
Issue Type: Bug
Components: Backend, Frontend
Affects Versions: Impala 2.10.0
Reporter: Alexander Behm
Attachments: 0001-Add-ConstTest-agg-fn.patch
While adding a new built-in aggregation function I noticed that constant
arguments are not accessible in the merge phase of the aggregation, i.e., in
Init(), Merge() or Finalize() of the merge phase.
In this example the 2nd constant argument is only accessible in the per-agg
phase, but not in the merge agg phase.
{code}
[localhost:21000] > explain select const_test(bigint_col, 0.1) from
functional.alltypes;
Query: explain select const_test(bigint_col, 0.1) from functional.alltypes
+----------------------------------------------+
| Explain String |
+----------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B |
| Per-Host Resource Estimates: Memory=148.00MB |
| Codegen disabled by planner |
| |
| PLAN-ROOT SINK |
| | |
| 03:AGGREGATE [FINALIZE] |
| | output: const_test:merge(bigint_col, 0.1) |
| | |
| 02:EXCHANGE [UNPARTITIONED] |
| | |
| 01:AGGREGATE |
| | output: const_test(bigint_col, 0.1) |
| | |
| 00:SCAN HDFS [functional.alltypes] |
| partitions=24/24 files=24 size=478.45KB |
+----------------------------------------------+
{code}
With num_nodes=1 the constant argument is accessible in the Finalize() of the
single aggregation phase:
[code}
[localhost:21000] > explain select const_test(bigint_col, 0.1) from
functional.alltypes;
Query: explain select const_test(bigint_col, 0.1) from functional.alltypes
+----------------------------------------------+
| Explain String |
+----------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B |
| Per-Host Resource Estimates: Memory=138.00MB |
| Codegen disabled by planner |
| |
| PLAN-ROOT SINK |
| | |
| 01:AGGREGATE [FINALIZE] |
| | output: const_test(bigint_col, 0.1) |
| | |
| 00:SCAN HDFS [functional.alltypes] |
| partitions=24/24 files=24 size=478.45KB |
+----------------------------------------------+
{code}
I've attached a patch produced with git format-patch for to repro the above.
It's not clear whether this behavior is a bug or intended. I can see arguments
both ways:
* An aggregation function can take any number of arguments.
* The Update() phase produces a single output slot.
* The Merge() phase consumes the single slot produced by the Update() phase and
produces another output slot.
* It would be quite convenient to have access to constant arguments of the
original SQL invocation in all phases of the aggregation.
* However, this seems semantically at odds with non-constant arguments. For
non-constant arguments one would expect the Update() to aggregate/store
whatever state is needed for Merge() in the single output slot. So why should
that be different for constant arguments?
* How would the planner decide which arguments to forward to the Merge() phase?
What would the BE aggregation function signatures look like? Today, the Merge()
phase always takes a single SlotRef as input.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)