[ 
https://issues.apache.org/jira/browse/DRILL-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554664#comment-16554664
 ] 

Parth Chandra commented on DRILL-6631:
--------------------------------------

The issue is caused by incorrect handling of empty batches in the streaming 
aggregator. In case of empty input and no group by, streaming agg sends out a 
'special' batch with no ( or null) records and a row count of 1. Once a special 
batch has been sent, streaming agg always returned a NONE outcome on subsequent 
calls to next().

In a lateral/unnest subquery, this behaviour needs to be emulated for every 
empty batch produced by unnest.  However, we cannot return NONE after sending 
out such a batch, but must reset the state. Streaming agg is handling this 
incorrectly and returning NONE causing the query to terminate early.

There are other issues with the handling of state in such a case. However none 
of the issues is caught by the unit tests because they all have a group-by.

 

> Wrong result from LateralUnnest query with aggregation and order by
> -------------------------------------------------------------------
>
>                 Key: DRILL-6631
>                 URL: https://issues.apache.org/jira/browse/DRILL-6631
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.14.0
>            Reporter: Parth Chandra
>            Priority: Major
>
> Reported by Chun:
> The following query gives correct result:
> {noformat}
> 0: jdbc:drill:zk=10.10.30.166:5181> select customer.c_custkey, 
> customer.c_name, orders.totalprice from customer, lateral (select 
> sum(t.o.o_totalprice) as totalprice from unnest(customer.c_orders) t(o) WHERE 
> t.o.o_totalprice in 
> (89230.03,270087.44,246408.53,82657.72,153941.38,65277.06,180309.76)) orders 
> where customer.c_custkey = 101276;
> +------------+---------------------+-------------+
> | c_custkey  |       c_name        | totalprice  |
> +------------+---------------------+-------------+
> | 101276     | Customer#000101276  | 82657.72    |
> +------------+---------------------+-------------+
> 1 row selected (6.184 seconds)
> {noformat}
> But if I remove the where clause and replace it with order by and limit, I 
> got the following empty result set. This is wrong.
> {noformat}
> 0: jdbc:drill:zk=10.10.30.166:5181> select customer.c_custkey, 
> customer.c_name, orders.totalprice from customer, lateral (select 
> sum(t.o.o_totalprice) as totalprice from unnest(customer.c_orders) t(o) WHERE 
> t.o.o_totalprice in 
> (89230.03,270087.44,246408.53,82657.72,153941.38,65277.06,180309.76)) orders 
> order by customer.c_custkey limit 50;
> +------------+---------+-------------+
> | c_custkey  | c_name  | totalprice  |
> +------------+---------+-------------+
> +------------+---------+-------------+
> No rows selected (2.753 seconds)
> {noformat}
> Here is the plan for the query giving the correct result:
> {noformat}
> 00-00    Screen : rowType = RecordType(ANY c_custkey, ANY c_name, ANY 
> totalprice): rowcount = 472783.35, cumulative cost = {8242193.7349999985 
> rows, 4.102185433499999E7 cpu, 0.0 io, 5.809561804799999E9 network, 0.0 
> memory}, id = 14410
> 00-01      Project(c_custkey=[$0], c_name=[$1], totalprice=[$2]) : rowType = 
> RecordType(ANY c_custkey, ANY c_name, ANY totalprice): rowcount = 472783.35, 
> cumulative cost = {8194915.3999999985 rows, 4.097457599999999E7 cpu, 0.0 io, 
> 5.809561804799999E9 network, 0.0 memory}, id = 14409
> 00-02        UnionExchange : rowType = RecordType(ANY c_custkey, ANY c_name, 
> ANY totalprice): rowcount = 472783.35, cumulative cost = {7722132.049999999 
> rows, 3.9556225949999996E7 cpu, 0.0 io, 5.809561804799999E9 network, 0.0 
> memory}, id = 14408
> 01-01          LateralJoin(correlation=[$cor1], joinType=[inner], 
> requiredColumns=[{0}], column excluded from output: =[`c_orders`]) : rowType 
> = RecordType(ANY c_custkey, ANY c_name, ANY totalprice): rowcount = 
> 472783.35, cumulative cost = {7249348.699999999 rows, 3.577395915E7 cpu, 0.0 
> io, 0.0 network, 0.0 memory}, id = 14407
> 01-03            SelectionVectorRemover : rowType = RecordType(ANY c_orders, 
> ANY c_custkey, ANY c_name): rowcount = 472783.35, cumulative cost = 
> {6776561.35 rows, 2.442713975E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 
> 14403
> 01-05              Filter(condition=[=($1, 101276)]) : rowType = 
> RecordType(ANY c_orders, ANY c_custkey, ANY c_name): rowcount = 472783.35, 
> cumulative cost = {6303778.0 rows, 2.39543564E7 cpu, 0.0 io, 0.0 network, 0.0 
> memory}, id = 14402
> 01-07                Scan(groupscan=[EasyGroupScan 
> [selectionRoot=maprfs:/drill/testdata/lateral/tpchsf1/json/customer, 
> numFiles=10, columns=[`c_orders`, `c_custkey`, `c_name`], 
> files=[maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_6.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_4.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_3.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_7.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_5.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_2.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_0.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_8.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_1.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_9.json]]]) : 
> rowType = RecordType(ANY c_orders, ANY c_custkey, ANY c_name): rowcount = 
> 3151889.0, cumulative cost = {3151889.0 rows, 9455667.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 14401
> 01-02            StreamAgg(group=[{}], totalprice=[SUM($0)]) : rowType = 
> RecordType(ANY totalprice): rowcount = 1.0, cumulative cost = {4.0 rows, 19.0 
> cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 14406
> 01-04              Filter(condition=[OR(=($0, 89230.03), =($0, 270087.44), 
> =($0, 246408.53), =($0, 82657.72), =($0, 153941.38), =($0, 65277.06), =($0, 
> 180309.76))]) : rowType = RecordType(ANY ITEM): rowcount = 1.0, cumulative 
> cost = {3.0 rows, 7.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 14405
> 01-06                Project(ITEM=[ITEM($0, 'o_totalprice')]) : rowType = 
> RecordType(ANY ITEM): rowcount = 1.0, cumulative cost = {2.0 rows, 2.0 cpu, 
> 0.0 io, 0.0 network, 0.0 memory}, id = 14404
> 01-08                  Unnest [srcOp=01-01]  : rowType = RecordType(ANY 
> c_orders): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 14240
> {noformat}
> And here is the plan for the query giving wrong result:
> {noformat}
> 00-00    Screen : rowType = RecordType(ANY c_custkey, ANY c_name, ANY 
> totalprice): rowcount = 50.0, cumulative cost = {2.2063382E7 rows, 
> 2.6342088596853146E8 cpu, 0.0 io, 7.7460824064E10 network, 0.0 memory}, id = 
> 15076
> 00-01      Project(c_custkey=[$0], c_name=[$1], totalprice=[$2]) : rowType = 
> RecordType(ANY c_custkey, ANY c_name, ANY totalprice): rowcount = 50.0, 
> cumulative cost = {2.2063377E7 rows, 2.6342088096853146E8 cpu, 0.0 io, 
> 7.7460824064E10 network, 0.0 memory}, id = 15075
> 00-02        SelectionVectorRemover : rowType = RecordType(ANY c_custkey, ANY 
> c_name, ANY totalprice): rowcount = 50.0, cumulative cost = {2.2063327E7 
> rows, 2.6342073096853146E8 cpu, 0.0 io, 7.7460824064E10 network, 0.0 memory}, 
> id = 15074
> 00-03          Limit(fetch=[50]) : rowType = RecordType(ANY c_custkey, ANY 
> c_name, ANY totalprice): rowcount = 50.0, cumulative cost = {2.2063277E7 
> rows, 2.6342068096853146E8 cpu, 0.0 io, 7.7460824064E10 network, 0.0 memory}, 
> id = 15073
> 00-04            SingleMergeExchange(sort0=[0]) : rowType = RecordType(ANY 
> c_custkey, ANY c_name, ANY totalprice): rowcount = 3151889.0, cumulative cost 
> = {2.2063227E7 rows, 2.6342048096853146E8 cpu, 0.0 io, 7.7460824064E10 
> network, 0.0 memory}, id = 15072
> 01-01              OrderedMuxExchange(sort0=[0]) : rowType = RecordType(ANY 
> c_custkey, ANY c_name, ANY totalprice): rowcount = 3151889.0, cumulative cost 
> = {1.8911338E7 rows, 2.1299025696853146E8 cpu, 0.0 io, 3.8730412032E10 
> network, 0.0 memory}, id = 15071
> 02-01                SelectionVectorRemover : rowType = RecordType(ANY 
> c_custkey, ANY c_name, ANY totalprice): rowcount = 3151889.0, cumulative cost 
> = {1.5759449E7 rows, 2.0983836796853146E8 cpu, 0.0 io, 3.8730412032E10 
> network, 0.0 memory}, id = 15070
> 02-02                  TopN(limit=[50]) : rowType = RecordType(ANY c_custkey, 
> ANY c_name, ANY totalprice): rowcount = 3151889.0, cumulative cost = 
> {1.260756E7 rows, 2.0668647896853146E8 cpu, 0.0 io, 3.8730412032E10 network, 
> 0.0 memory}, id = 15069
> 02-03                    HashToRandomExchange(dist0=[[$0]]) : rowType = 
> RecordType(ANY c_custkey, ANY c_name, ANY totalprice): rowcount = 3151889.0, 
> cumulative cost = {9455671.0 rows, 1.35531246E8 cpu, 0.0 io, 3.8730412032E10 
> network, 0.0 memory}, id = 15068
> 03-01                      LateralJoin(correlation=[$cor1], joinType=[inner], 
> requiredColumns=[{0}], column excluded from output: =[`c_orders`]) : rowType 
> = RecordType(ANY c_custkey, ANY c_name, ANY totalprice): rowcount = 
> 3151889.0, cumulative cost = {6303782.0 rows, 8.5101022E7 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 15067
> 03-03                        Scan(groupscan=[EasyGroupScan 
> [selectionRoot=maprfs:/drill/testdata/lateral/tpchsf1/json/customer, 
> numFiles=10, columns=[`c_orders`, `c_custkey`, `c_name`], 
> files=[maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_6.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_4.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_3.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_7.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_5.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_2.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_0.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_8.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_1.json, 
> maprfs:///drill/testdata/lateral/tpchsf1/json/customer/0_0_9.json]]]) : 
> rowType = RecordType(ANY c_orders, ANY c_custkey, ANY c_name): rowcount = 
> 3151889.0, cumulative cost = {3151889.0 rows, 9455667.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 15063
> 03-02                        StreamAgg(group=[{}], totalprice=[SUM($0)]) : 
> rowType = RecordType(ANY totalprice): rowcount = 1.0, cumulative cost = {4.0 
> rows, 19.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 15066
> 03-04                          Filter(condition=[OR(=($0, 89230.03), =($0, 
> 270087.44), =($0, 246408.53), =($0, 82657.72), =($0, 153941.38), =($0, 
> 65277.06), =($0, 180309.76))]) : rowType = RecordType(ANY ITEM): rowcount = 
> 1.0, cumulative cost = {3.0 rows, 7.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, 
> id = 15065
> 03-05                            Project(ITEM=[ITEM($0, 'o_totalprice')]) : 
> rowType = RecordType(ANY ITEM): rowcount = 1.0, cumulative cost = {2.0 rows, 
> 2.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 15064
> 03-06                              Unnest [srcOp=03-01]  : rowType = 
> RecordType(ANY c_orders): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 
> cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 14860{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to