Laurent Goujon created DRILL-4468: ------------------------------------- Summary: Aggregates over empty input might fail Key: DRILL-4468 URL: https://issues.apache.org/jira/browse/DRILL-4468 Project: Apache Drill Issue Type: Bug Environment: Linux/OpenJDK 7 Reporter: Laurent Goujon
Some aggregation queries over empty input might fail, depending of the column ordering. This query for example would fail: {noformat} select sum(int_col) col1, sum(bigint_col) col2 from cp.`employee.json` where 1 = 0 org.apache.drill.common.exceptions.UserRemoteException: UNSUPPORTED_OPERATION ERROR: Only COUNT, MIN and MAX aggregate functions supported for VarChar type Fragment 0:0 [Error Id: dcef042c-1c53-40df-88b0-816d3cb109a7 on xxx:31010] {noformat} But this one would succeed: {noformat} select sum(bigint_col) col2, sum(int_col) col1 from cp.`employee.json` where 1 = 0 null null {noformat} The reason for why only one query fails is because of DRILL-4467. The consequence is that the plans are significantly different, and don't behave quite the same way. Here's the Physical plan for the first query: {noformat} 00-00 Screen : rowType = RecordType(ANY col1, ANY col2): rowcount = 1.0, cumulative cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 339 00-01 Project(col1=[$0], col2=[$1]) : rowType = RecordType(ANY col1, ANY col2): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 338 00-02 StreamAgg(group=[{}], col1=[SUM($0)], col2=[SUM($1)]) : rowType = RecordType(ANY col1, ANY col2): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 337 00-03 Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY int_col, ANY bigint_col): rowcount = 1.0, cumulative cost = {463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 336 00-04 Scan(groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json, numFiles=1, columns=[`int_col`, `bigint_col`], files=[classpath:/employee.json]]]) : rowType = RecordType(ANY int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 335 {noformat} and the physical plan for the second query: {noformat} 00-00 Screen : rowType = RecordType(ANY col2, ANY col1): rowcount = 1.0, cumulative cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 775 00-01 Project(col2=[$0], col1=[$1]) : rowType = RecordType(ANY col2, ANY col1): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 774 00-02 StreamAgg(group=[{}], col2=[SUM($0)], col1=[SUM($1)]) : rowType = RecordType(ANY col2, ANY col1): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 773 00-03 Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY bigint_col, ANY int_col): rowcount = 1.0, cumulative cost = {463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 772 00-04 Project(bigint_col=[$1], int_col=[$0]) : rowType = RecordType(ANY bigint_col, ANY int_col): rowcount = 463.0, cumulative cost = {463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 771 00-05 Scan(groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json, numFiles=1, columns=[`bigint_col`, `int_col`], files=[classpath:/employee.json]]]) : rowType = RecordType(ANY int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 770 {noformat} The extra projection just before the scan seems to hide the VARCHAR type of the columns, and allow for aggregation to succeed. On the other hand, the storage plugin allows for column push down, so the projection is theoretically unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)