Aman Sinha created IMPALA-10755: ----------------------------------- Summary: Wrong results for a query with predicate on an analytic function Key: IMPALA-10755 URL: https://issues.apache.org/jira/browse/IMPALA-10755 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.0 Reporter: Aman Sinha
In the master branch, the following query with equality predicate on analytic function produces 13 rows whereas the correct result is 3 rows (as verified with Hive): {noformat} use tpch; select * FROM ( SELECT l_quantity, l_linenumber, nr_pvp, max(nr_pvp) over (partition BY l_quantity) max_nr_pvp FROM ( SELECT l_quantity, l_linenumber, count(1) AS nr_pvp FROM lineitem, partsupp where l_partkey = ps_partkey and ps_suppkey = 10 and l_quantity in (1, 2) GROUP BY l_quantity, l_linenumber ) a ) b WHERE nr_pvp = max_nr_pvp order by 1, 2, 3, 4; {noformat} Wrong results: {noformat} +------------+--------------+--------+------------+ | l_quantity | l_linenumber | nr_pvp | max_nr_pvp | +------------+--------------+--------+------------+ | 1.00 | 1 | 11 | 12 | | 1.00 | 2 | 12 | 12 | | 1.00 | 3 | 10 | 12 | | 1.00 | 4 | 5 | 12 | | 1.00 | 5 | 4 | 12 | | 1.00 | 6 | 5 | 12 | | 1.00 | 7 | 2 | 12 | | 2.00 | 1 | 12 | 12 | | 2.00 | 2 | 12 | 12 | | 2.00 | 3 | 11 | 12 | | 2.00 | 4 | 8 | 12 | | 2.00 | 5 | 3 | 12 | | 2.00 | 6 | 2 | 12 | +------------+--------------+--------+------------+ Fetched 13 row(s) in 1.36s {noformat} Expected results (verified with Hive): {noformat} +---------------+-----------------+-----------+---------------+ | b.l_quantity | b.l_linenumber | b.nr_pvp | b.max_nr_pvp | +---------------+-----------------+-----------+---------------+ | 1.00 | 2 | 12 | 12 | | 2.00 | 1 | 12 | 12 | | 2.00 | 2 | 12 | 12 | +---------------+-----------------+-----------+---------------+ {noformat} Explain plan for Impala shows that the equality predicate 'nr_pvp = max_nr_pvp' which is supposed to be applied on top of the analytic operator is missing: {noformat} +---------------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=42.88MB Threads=7 | | Per-Host Resource Estimates: Memory=483MB | | | | PLAN-ROOT SINK | | | | | 11:MERGING-EXCHANGE [UNPARTITIONED] | | | order by: l_quantity ASC, l_linenumber ASC, nr_pvp ASC, max_nr_pvp ASC | | | | | 06:SORT | | | order by: l_quantity ASC, l_linenumber ASC, nr_pvp ASC, max_nr_pvp ASC | | | row-size=28B cardinality=357 | | | | | 05:ANALYTIC | | | functions: max(count(*)) | | | partition by: l_quantity | | | row-size=28B cardinality=357 | | | | | 04:SORT | | | order by: l_quantity ASC NULLS LAST | | | row-size=20B cardinality=357 | | | | | 10:EXCHANGE [HASH(l_quantity)] | | | | | 09:AGGREGATE [FINALIZE] | | | output: count:merge(*) | | | group by: l_quantity, l_linenumber | | | row-size=20B cardinality=357 | | | | | 08:EXCHANGE [HASH(l_quantity,l_linenumber)] | | | | | 03:AGGREGATE [STREAMING] | | | output: count(*) | | | group by: l_quantity, l_linenumber | | | row-size=20B cardinality=357 | | | | | 02:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: l_partkey = ps_partkey | | | runtime filters: RF000 <- ps_partkey | | | row-size=36B cardinality=2.45K | | | | | |--07:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [tpch.partsupp] | | | HDFS partitions=1/1 files=1 size=112.71MB | | | predicates: ps_suppkey = 10 | | | row-size=16B cardinality=82 | | | | | 00:SCAN HDFS [tpch.lineitem] | | HDFS partitions=1/1 files=1 size=718.94MB | | predicates: l_quantity IN (1, 2) | | runtime filters: RF000 -> l_partkey | | row-size=20B cardinality=235.34K | +---------------------------------------------------------------------------+ {noformat} This is likely a regression since based on some internal testing this was working correctly on an earlier version. -- This message was sent by Atlassian Jira (v8.3.4#803005)