[jira] [Commented] (IMPALA-10755) Wrong results for a query with predicate on an analytic function

Aman Sinha (Jira) Mon, 21 Jun 2021 10:55:04 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-10755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366761#comment-17366761
 ]


Aman Sinha commented on IMPALA-10755:
-------------------------------------

As part of the changes in IMPALA-9979 (part 2 partitioned top-n) we made 
changes to push down predicates that reference analytic tuple into the inline 
view [1]. The regression got introduced because in the above example, the 
BinaryPredicate nr_pvp = max_nr_pvp is actually referencing 2 tuple ids, one of 
which (the right side) is the analytic tuple. In this case, the pushdown logic 
does not detect it and pushes into the inline view and removes the predicate 
from the unassigned conjuncts list thinking the analytic planner will take care 
of it. However, the analytic planner does not find the unassigned conjunct 
likely because it involves a non analytic tuple id (I need to debug this part a 
little more.  This is based on my initial investigation).  I have a potential 
fix that I am testing. 

[1] 
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1333

> Wrong results for a query with predicate on an analytic function
> ----------------------------------------------------------------
>
>                 Key: IMPALA-10755
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10755
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Critical
>              Labels: correctness, wrongresults
>
> In the master branch, the following query with equality predicate on analytic 
> function produces 13 rows whereas the correct result is 3 rows (as verified 
> with Hive):
> {noformat}
> use tpch;
> select *
>   FROM (
>     SELECT
>       l_quantity, l_linenumber,
>       nr_pvp,
>       max(nr_pvp) over (partition BY l_quantity) max_nr_pvp
>     FROM (
>       SELECT
>         l_quantity, l_linenumber,
>         count(1) AS nr_pvp
>       FROM
>         lineitem, partsupp
>         where l_partkey       = ps_partkey
>         and ps_suppkey = 10 and  l_quantity in (1, 2)
>         GROUP BY l_quantity, l_linenumber
>     ) a
>   ) b
>   WHERE nr_pvp = max_nr_pvp
>   order by 1, 2, 3, 4;
> {noformat}
> Wrong results:
> {noformat}
> +------------+--------------+--------+------------+                           
>                            
> | l_quantity | l_linenumber | nr_pvp | max_nr_pvp |
> +------------+--------------+--------+------------+
> | 1.00       | 1            | 11     | 12         |
> | 1.00       | 2            | 12     | 12         |
> | 1.00       | 3            | 10     | 12         |
> | 1.00       | 4            | 5      | 12         |
> | 1.00       | 5            | 4      | 12         |
> | 1.00       | 6            | 5      | 12         |
> | 1.00       | 7            | 2      | 12         |
> | 2.00       | 1            | 12     | 12         |
> | 2.00       | 2            | 12     | 12         |
> | 2.00       | 3            | 11     | 12         |
> | 2.00       | 4            | 8      | 12         |
> | 2.00       | 5            | 3      | 12         |
> | 2.00       | 6            | 2      | 12         |
> +------------+--------------+--------+------------+
> Fetched 13 row(s) in 1.36s
> {noformat}
> Expected results (verified with Hive):
> {noformat}
> +---------------+-----------------+-----------+---------------+
> | b.l_quantity  | b.l_linenumber  | b.nr_pvp  | b.max_nr_pvp  |
> +---------------+-----------------+-----------+---------------+
> | 1.00          | 2               | 12        | 12            |
> | 2.00          | 1               | 12        | 12            |
> | 2.00          | 2               | 12        | 12            |
> +---------------+-----------------+-----------+---------------+
> {noformat}
> Explain plan for Impala shows that the equality predicate 'nr_pvp = 
> max_nr_pvp' which is supposed to be applied on top of the analytic operator 
> is missing:
> {noformat}
> +---------------------------------------------------------------------------+
> | Explain String                                                            |
> +---------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=42.88MB Threads=7               |
> | Per-Host Resource Estimates: Memory=483MB                                 |
> |                                                                           |
> | PLAN-ROOT SINK                                                            |
> | |                                                                         |
> | 11:MERGING-EXCHANGE [UNPARTITIONED]                                       |
> | |  order by: l_quantity ASC, l_linenumber ASC, nr_pvp ASC, max_nr_pvp ASC |
> | |                                                                         |
> | 06:SORT                                                                   |
> | |  order by: l_quantity ASC, l_linenumber ASC, nr_pvp ASC, max_nr_pvp ASC |
> | |  row-size=28B cardinality=357                                           |
> | |                                                                         |
> | 05:ANALYTIC                                                               |
> | |  functions: max(count(*))                                               |
> | |  partition by: l_quantity                                               |
> | |  row-size=28B cardinality=357                                           |
> | |                                                                         |
> | 04:SORT                                                                   |
> | |  order by: l_quantity ASC NULLS LAST                                    |
> | |  row-size=20B cardinality=357                                           |
> | |                                                                         |
> | 10:EXCHANGE [HASH(l_quantity)]                                            |
> | |                                                                         |
> | 09:AGGREGATE [FINALIZE]                                                   |
> | |  output: count:merge(*)                                                 |
> | |  group by: l_quantity, l_linenumber                                     |
> | |  row-size=20B cardinality=357                                           |
> | |                                                                         |
> | 08:EXCHANGE [HASH(l_quantity,l_linenumber)]                               |
> | |                                                                         |
> | 03:AGGREGATE [STREAMING]                                                  |
> | |  output: count(*)                                                       |
> | |  group by: l_quantity, l_linenumber                                     |
> | |  row-size=20B cardinality=357                                           |
> | |                                                                         |
> | 02:HASH JOIN [INNER JOIN, BROADCAST]                                      |
> | |  hash predicates: l_partkey = ps_partkey                                |
> | |  runtime filters: RF000 <- ps_partkey                                   |
> | |  row-size=36B cardinality=2.45K                                         |
> | |                                                                         |
> | |--07:EXCHANGE [BROADCAST]                                                |
> | |  |                                                                      |
> | |  01:SCAN HDFS [tpch.partsupp]                                           |
> | |     HDFS partitions=1/1 files=1 size=112.71MB                           |
> | |     predicates: ps_suppkey = 10                                         |
> | |     row-size=16B cardinality=82                                         |
> | |                                                                         |
> | 00:SCAN HDFS [tpch.lineitem]                                              |
> |    HDFS partitions=1/1 files=1 size=718.94MB                              |
> |    predicates: l_quantity IN (1, 2)                                       |
> |    runtime filters: RF000 -> l_partkey                                    |
> |    row-size=20B cardinality=235.34K                                       |
> +---------------------------------------------------------------------------+
> {noformat}
> This is likely a regression since based on some internal testing this was 
> working correctly on an earlier version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-10755) Wrong results for a query with predicate on an analytic function

Reply via email to