Laurent Goujon created DRILL-4468:
-------------------------------------

             Summary: Aggregates over empty input might fail
                 Key: DRILL-4468
                 URL: https://issues.apache.org/jira/browse/DRILL-4468
             Project: Apache Drill
          Issue Type: Bug
         Environment: Linux/OpenJDK 7
            Reporter: Laurent Goujon


Some aggregation queries over empty input might fail, depending of the column 
ordering.

This query for example would fail:
{noformat}
select sum(int_col) col1, sum(bigint_col) col2 from cp.`employee.json` where 1 
= 0
org.apache.drill.common.exceptions.UserRemoteException: UNSUPPORTED_OPERATION 
ERROR: Only COUNT, MIN and MAX aggregate functions supported for VarChar type 
Fragment 0:0 [Error Id: dcef042c-1c53-40df-88b0-816d3cb109a7 on xxx:31010] 
{noformat}

But this one would succeed:
{noformat}
select sum(bigint_col) col2, sum(int_col) col1 from cp.`employee.json` where 1 
= 0
null    null
{noformat}

The reason for why only one query fails is because of DRILL-4467. The 
consequence is that the plans are significantly different, and don't behave 
quite the same way.

Here's the Physical plan for the first query:
{noformat}
00-00    Screen : rowType = RecordType(ANY col1, ANY col2): rowcount = 1.0, 
cumulative cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
= 339
00-01      Project(col1=[$0], col2=[$1]) : rowType = RecordType(ANY col1, ANY 
col2): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 338
00-02        StreamAgg(group=[{}], col1=[SUM($0)], col2=[SUM($1)]) : rowType = 
RecordType(ANY col1, ANY col2): rowcount = 1.0, cumulative cost = {464.0 rows, 
950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 337
00-03          Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY int_col, 
ANY bigint_col): rowcount = 1.0, cumulative cost = {463.0 rows, 926.0 cpu, 0.0 
io, 0.0 network, 0.0 memory}, id = 336
00-04            Scan(groupscan=[EasyGroupScan 
[selectionRoot=classpath:/employee.json, numFiles=1, columns=[`int_col`, 
`bigint_col`], files=[classpath:/employee.json]]]) : rowType = RecordType(ANY 
int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows, 
926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 335
{noformat} 

and the physical plan for the second query:
{noformat}
00-00    Screen : rowType = RecordType(ANY col2, ANY col1): rowcount = 1.0, 
cumulative cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
= 775
00-01      Project(col2=[$0], col1=[$1]) : rowType = RecordType(ANY col2, ANY 
col1): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 774
00-02        StreamAgg(group=[{}], col2=[SUM($0)], col1=[SUM($1)]) : rowType = 
RecordType(ANY col2, ANY col1): rowcount = 1.0, cumulative cost = {464.0 rows, 
950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 773
00-03          Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY 
bigint_col, ANY int_col): rowcount = 1.0, cumulative cost = {463.0 rows, 926.0 
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 772
00-04            Project(bigint_col=[$1], int_col=[$0]) : rowType = 
RecordType(ANY bigint_col, ANY int_col): rowcount = 463.0, cumulative cost = 
{463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 771
00-05              Scan(groupscan=[EasyGroupScan 
[selectionRoot=classpath:/employee.json, numFiles=1, columns=[`bigint_col`, 
`int_col`], files=[classpath:/employee.json]]]) : rowType = RecordType(ANY 
int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows, 
926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 770
{noformat}

The extra projection just before the scan seems to hide the VARCHAR type of the 
columns, and allow for aggregation to succeed. On the other hand, the storage 
plugin allows for column push down, so the projection is theoretically 
unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to