Victoria Markman created DRILL-2794: ---------------------------------------
Summary: Partition pruning is not happening correctly when maxdir/mindir is used in the filter condition Key: DRILL-2794 URL: https://issues.apache.org/jira/browse/DRILL-2794 Project: Apache Drill Issue Type: Bug Components: Query Planning & Optimization Affects Versions: 0.9.0 Reporter: Victoria Markman Assignee: Jinfeng Ni Directory structure: {code} [Tue Apr 14 13:43:54 root@/mapr/vmarkman.cluster.com/test/smalltable ] # ls -R .: 2014 2015 2016 ./2014: ./2015: 01 02 ./2015/01: t1.csv ./2015/02: t2.csv ./2016: t1.csv [Tue Apr 14 13:44:26 root@/mapr/vmarkman.cluster.com/test/bigtable ] # ls -R .: 2015 2016 ./2015: 01 02 03 04 ./2015/01: 0_0_0.parquet 1_0_0.parquet 2_0_0.parquet 3_0_0.parquet 4_0_0.parquet 5_0_0.parquet ./2015/02: 0_0_0.parquet ./2015/03: 0_0_0.parquet ./2015/04: 0_0_0.parquet ./2016: 01 parquet.file ./2016/01: 0_0_0.parquet {code} Simple case, partition pruning is happening correctly: only 2016 directory is scanned from 'smalltable'. {code} 0: jdbc:drill:schema=dfs> explain plan for select * from smalltable where dir0 = maxdir('dfs.test', 'bigtable'); +------------+------------+ | text | json | +------------+------------+ | 00-00 Screen 00-01 Project(*=[$0]) 00-02 Project(*=[$0]) 00-03 Scan(groupscan=[EasyGroupScan [selectionRoot=/test/smalltable, numFiles=1, columns=[`*`], files=[maprfs:/test/smalltable/2016/t1.csv]]]) | { "head" : { "version" : 1, "generator" : { "type" : "ExplainHandler", "info" : "" }, "type" : "APACHE_DRILL_PHYSICAL", "options" : [ ], "queue" : 0, "resultMode" : "EXEC" }, "graph" : [ { "pop" : "fs-scan", "@id" : 3, "files" : [ "maprfs:/test/smalltable/2016/t1.csv" ], "storage" : { "type" : "file", "enabled" : true, "connection" : "maprfs:///", "workspaces" : { "root" : { "location" : "/", "writable" : false, "defaultInputFormat" : null }, ... ... {code} With added second predicate (dir1 = mindir('dfs.test', 'bigtable/2016') which evaluates to false (there is no directory '01' in smalltable) we end up scanning everything in the smalltable. This does not look right to me and I think this is a bug. {code} 0: jdbc:drill:schema=dfs> explain plan for select * from smalltable where dir0 = maxdir('dfs.test', 'bigtable') and dir1 = mindir('dfs.test', 'bigtable/2016'); +------------+------------+ | text | json | +------------+------------+ | 00-00 Screen 00-01 Project(*=[$0]) 00-02 Project(T15¦¦*=[$0]) 00-03 SelectionVectorRemover 00-04 Filter(condition=[AND(=($1, '2016'), =($2, '01'))]) 00-05 Project(T15¦¦*=[$0], dir0=[$1], dir1=[$2]) 00-06 Scan(groupscan=[EasyGroupScan [selectionRoot=/test/smalltable, numFiles=3, columns=[`*`], files=[maprfs:/test/smalltable/2015/01/t1.csv, maprfs:/test/smalltable/2015/02/t2.csv, maprfs:/test/smalltable/2016/t1.csv]]]) | { "head" : { "version" : 1, "generator" : { "type" : "ExplainHandler", "info" : "" }, "type" : "APACHE_DRILL_PHYSICAL", "options" : [ ], "queue" : 0, "resultMode" : "EXEC" }, "graph" : [ { "pop" : "fs-scan", "@id" : 6, "files" : [ "maprfs:/test/smalltable/2015/01/t1.csv", "maprfs:/test/smalltable/2015/02/t2.csv", "maprfs:/test/smalltable/2016/t1.csv" ], "storage" : { "type" : "file", "enabled" : true, "connection" : "maprfs:///", "workspaces" : { "root" : { "location" : "/", "writable" : false, "defaultInputFormat" : null }, ... ... {code} Here is a similar example with parquet file where predicate "a1=11" evaluates to false. {code} 0: jdbc:drill:schema=dfs> explain plan for select * from bigtable where dir0=maxdir('dfs.test','bigtable') and a1 = 11; +------------+------------+ | text | json | +------------+------------+ | 00-00 Screen 00-01 Project(*=[$0]) 00-02 Project(T25¦¦*=[$0]) 00-03 SelectionVectorRemover 00-04 Filter(condition=[AND(=($1, '2016'), =($2, 11))]) 00-05 Project(T25¦¦*=[$0], dir0=[$1], a1=[$2]) 00-06 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/bigtable/2016/01/0_0_0.parquet], ReadEntryWithPath [path=maprfs:/test/bigtable/2016/parquet.file]], selectionRoot=/test/bigtable, numFiles=2, columns=[`*`]]]) | { "head" : { "version" : 1, "generator" : { "type" : "ExplainHandler", "info" : "" }, "type" : "APACHE_DRILL_PHYSICAL", "options" : [ ], "queue" : 0, "resultMode" : "EXEC" }, "graph" : [ { "pop" : "parquet-scan", "@id" : 6, "entries" : [ { "path" : "maprfs:/test/bigtable/2016/01/0_0_0.parquet" }, { "path" : "maprfs:/test/bigtable/2016/parquet.file" } ], {code} And finally, when we use the same table in the from clause and in maxdir/mindir, we scan only one file (to return schema): I would think that the same should happen in the bug case above ... {code} 0: jdbc:drill:schema=dfs> explain plan for select * from bigtable where dir0 = maxdir('dfs.test', 'bigtable') and dir1 = mindir('dfs.test', 'bigtable/2016'); +------------+------------+ | text | json | +------------+------------+ | 00-00 Screen 00-01 Project(*=[$0]) 00-02 Project(T29¦¦*=[$0]) 00-03 SelectionVectorRemover 00-04 Filter(condition=[AND(=($1, '2016'), =($2, 'parquet.file'))]) 00-05 Project(T29¦¦*=[$0], dir0=[$1], dir1=[$2]) 00-06 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/bigtable/2015/01/4_0_0.parquet]], selectionRoot=/test/bigtable, numFiles=1, columns=[`*`]]]) | { "head" : { "version" : 1, "generator" : { "type" : "ExplainHandler", "info" : "" }, "type" : "APACHE_DRILL_PHYSICAL", "options" : [ ], "queue" : 0, "resultMode" : "EXEC" }, "graph" : [ { "pop" : "parquet-scan", "@id" : 6, "entries" : [ { "path" : "maprfs:/test/bigtable/2015/01/4_0_0.parquet" } ], {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)