Alexander Behm created IMPALA-6179:
--------------------------------------

             Summary: Constant argument to UDAF not accessible in merge phase 
of distributed execution
                 Key: IMPALA-6179
                 URL: https://issues.apache.org/jira/browse/IMPALA-6179
             Project: IMPALA
          Issue Type: Bug
          Components: Backend, Frontend
    Affects Versions: Impala 2.10.0
            Reporter: Alexander Behm
         Attachments: 0001-Add-ConstTest-agg-fn.patch

While adding a new built-in aggregation function I noticed that constant 
arguments are not accessible in the merge phase of the aggregation, i.e.,  in 
Init(), Merge() or Finalize() of the merge phase.

In this example the 2nd constant argument is only accessible in the per-agg 
phase, but not in the merge agg phase.
{code}
[localhost:21000] > explain select const_test(bigint_col, 0.1) from 
functional.alltypes;
Query: explain select const_test(bigint_col, 0.1) from functional.alltypes
+----------------------------------------------+
| Explain String                               |
+----------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B |
| Per-Host Resource Estimates: Memory=148.00MB |
| Codegen disabled by planner                  |
|                                              |
| PLAN-ROOT SINK                               |
| |                                            |
| 03:AGGREGATE [FINALIZE]                      |
| |  output: const_test:merge(bigint_col, 0.1) |
| |                                            |
| 02:EXCHANGE [UNPARTITIONED]                  |
| |                                            |
| 01:AGGREGATE                                 |
| |  output: const_test(bigint_col, 0.1)       |
| |                                            |
| 00:SCAN HDFS [functional.alltypes]           |
|    partitions=24/24 files=24 size=478.45KB   |
+----------------------------------------------+
{code}

With num_nodes=1 the constant argument is accessible in the Finalize() of the 
single aggregation phase:
[code}
[localhost:21000] > explain select const_test(bigint_col, 0.1) from 
functional.alltypes;
Query: explain select const_test(bigint_col, 0.1) from functional.alltypes
+----------------------------------------------+
| Explain String                               |
+----------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B |
| Per-Host Resource Estimates: Memory=138.00MB |
| Codegen disabled by planner                  |
|                                              |
| PLAN-ROOT SINK                               |
| |                                            |
| 01:AGGREGATE [FINALIZE]                      |
| |  output: const_test(bigint_col, 0.1)       |
| |                                            |
| 00:SCAN HDFS [functional.alltypes]           |
|    partitions=24/24 files=24 size=478.45KB   |
+----------------------------------------------+
{code}

I've attached a patch produced with git format-patch for to repro the above.

It's not clear whether this behavior is a bug or intended. I can see arguments 
both ways:
* An aggregation function can take any number of arguments.
* The Update() phase produces a single output slot.
* The Merge() phase consumes the single slot produced by the Update() phase and 
produces another output slot.
* It would be quite convenient to have access to constant arguments of the 
original SQL invocation in all phases of the aggregation.
* However, this seems semantically at odds with non-constant arguments. For 
non-constant arguments one would expect the Update() to aggregate/store 
whatever state is needed for Merge() in the single output slot. So why should 
that be different for constant arguments?
* How would the planner decide which arguments to forward to the Merge() phase? 
What would the BE aggregation function signatures look like? Today, the Merge() 
phase always takes a single SlotRef as input.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to