Aman Sinha created IMPALA-10755:
-----------------------------------

             Summary: Wrong results for a query with predicate on an analytic 
function
                 Key: IMPALA-10755
                 URL: https://issues.apache.org/jira/browse/IMPALA-10755
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 4.0
            Reporter: Aman Sinha


In the master branch, the following query with equality predicate on analytic 
function produces 13 rows whereas the correct result is 3 rows (as verified 
with Hive):

{noformat}

use tpch;

select *
  FROM (
    SELECT
      l_quantity, l_linenumber,
      nr_pvp,
      max(nr_pvp) over (partition BY l_quantity) max_nr_pvp
    FROM (
      SELECT
        l_quantity, l_linenumber,
        count(1) AS nr_pvp
      FROM
        lineitem, partsupp
        where l_partkey = ps_partkey
        and ps_suppkey = 10 and  l_quantity in (1, 2)
        GROUP BY l_quantity, l_linenumber
    ) a
  ) b
  WHERE nr_pvp = max_nr_pvp
  order by 1, 2, 3, 4;
{noformat}

Wrong results:
{noformat}
+------------+--------------+--------+------------+                             
                         
| l_quantity | l_linenumber | nr_pvp | max_nr_pvp |
+------------+--------------+--------+------------+
| 1.00       | 1            | 11     | 12         |
| 1.00       | 2            | 12     | 12         |
| 1.00       | 3            | 10     | 12         |
| 1.00       | 4            | 5      | 12         |
| 1.00       | 5            | 4      | 12         |
| 1.00       | 6            | 5      | 12         |
| 1.00       | 7            | 2      | 12         |
| 2.00       | 1            | 12     | 12         |
| 2.00       | 2            | 12     | 12         |
| 2.00       | 3            | 11     | 12         |
| 2.00       | 4            | 8      | 12         |
| 2.00       | 5            | 3      | 12         |
| 2.00       | 6            | 2      | 12         |
+------------+--------------+--------+------------+
Fetched 13 row(s) in 1.36s
{noformat}

Expected results (verified with Hive):
{noformat}
+---------------+-----------------+-----------+---------------+
| b.l_quantity  | b.l_linenumber  | b.nr_pvp  | b.max_nr_pvp  |
+---------------+-----------------+-----------+---------------+
| 1.00          | 2               | 12        | 12            |
| 2.00          | 1               | 12        | 12            |
| 2.00          | 2               | 12        | 12            |
+---------------+-----------------+-----------+---------------+
{noformat}

Explain plan for Impala shows that the equality predicate 'nr_pvp = max_nr_pvp' 
which is supposed to be applied on top of the analytic operator is missing:
{noformat}
+---------------------------------------------------------------------------+
| Explain String                                                            |
+---------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=42.88MB Threads=7               |
| Per-Host Resource Estimates: Memory=483MB                                 |
|                                                                           |
| PLAN-ROOT SINK                                                            |
| |                                                                         |
| 11:MERGING-EXCHANGE [UNPARTITIONED]                                       |
| |  order by: l_quantity ASC, l_linenumber ASC, nr_pvp ASC, max_nr_pvp ASC |
| |                                                                         |
| 06:SORT                                                                   |
| |  order by: l_quantity ASC, l_linenumber ASC, nr_pvp ASC, max_nr_pvp ASC |
| |  row-size=28B cardinality=357                                           |
| |                                                                         |
| 05:ANALYTIC                                                               |
| |  functions: max(count(*))                                               |
| |  partition by: l_quantity                                               |
| |  row-size=28B cardinality=357                                           |
| |                                                                         |
| 04:SORT                                                                   |
| |  order by: l_quantity ASC NULLS LAST                                    |
| |  row-size=20B cardinality=357                                           |
| |                                                                         |
| 10:EXCHANGE [HASH(l_quantity)]                                            |
| |                                                                         |
| 09:AGGREGATE [FINALIZE]                                                   |
| |  output: count:merge(*)                                                 |
| |  group by: l_quantity, l_linenumber                                     |
| |  row-size=20B cardinality=357                                           |
| |                                                                         |
| 08:EXCHANGE [HASH(l_quantity,l_linenumber)]                               |
| |                                                                         |
| 03:AGGREGATE [STREAMING]                                                  |
| |  output: count(*)                                                       |
| |  group by: l_quantity, l_linenumber                                     |
| |  row-size=20B cardinality=357                                           |
| |                                                                         |
| 02:HASH JOIN [INNER JOIN, BROADCAST]                                      |
| |  hash predicates: l_partkey = ps_partkey                                |
| |  runtime filters: RF000 <- ps_partkey                                   |
| |  row-size=36B cardinality=2.45K                                         |
| |                                                                         |
| |--07:EXCHANGE [BROADCAST]                                                |
| |  |                                                                      |
| |  01:SCAN HDFS [tpch.partsupp]                                           |
| |     HDFS partitions=1/1 files=1 size=112.71MB                           |
| |     predicates: ps_suppkey = 10                                         |
| |     row-size=16B cardinality=82                                         |
| |                                                                         |
| 00:SCAN HDFS [tpch.lineitem]                                              |
|    HDFS partitions=1/1 files=1 size=718.94MB                              |
|    predicates: l_quantity IN (1, 2)                                       |
|    runtime filters: RF000 -> l_partkey                                    |
|    row-size=20B cardinality=235.34K                                       |
+---------------------------------------------------------------------------+
{noformat}

This is likely a regression since based on some internal testing this was 
working correctly on an earlier version.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to