Victoria Markman created DRILL-2794:
---------------------------------------

             Summary: Partition pruning is not happening correctly when 
maxdir/mindir is used in the filter condition
                 Key: DRILL-2794
                 URL: https://issues.apache.org/jira/browse/DRILL-2794
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
    Affects Versions: 0.9.0
            Reporter: Victoria Markman
            Assignee: Jinfeng Ni


Directory structure:
{code}
[Tue Apr 14 13:43:54 root@/mapr/vmarkman.cluster.com/test/smalltable ] # ls -R
.:
2014  2015  2016

./2014:

./2015:
01  02

./2015/01:
t1.csv

./2015/02:
t2.csv

./2016:
t1.csv

[Tue Apr 14 13:44:26 root@/mapr/vmarkman.cluster.com/test/bigtable ] # ls -R
.:
2015  2016

./2015:
01  02  03  04

./2015/01:
0_0_0.parquet  1_0_0.parquet  2_0_0.parquet  3_0_0.parquet  4_0_0.parquet  
5_0_0.parquet

./2015/02:
0_0_0.parquet

./2015/03:
0_0_0.parquet

./2015/04:
0_0_0.parquet

./2016:
01  parquet.file

./2016/01:
0_0_0.parquet
{code}

Simple case, partition pruning is happening correctly: only 2016 directory is 
scanned from 'smalltable'.
{code}
0: jdbc:drill:schema=dfs> explain plan for select * from smalltable where dir0 
= maxdir('dfs.test', 'bigtable');
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(*=[$0])
00-02        Project(*=[$0])
00-03          Scan(groupscan=[EasyGroupScan [selectionRoot=/test/smalltable, 
numFiles=1, columns=[`*`], files=[maprfs:/test/smalltable/2016/t1.csv]]])
 | {
  "head" : {
    "version" : 1,
    "generator" : {
      "type" : "ExplainHandler",
      "info" : ""
    },
    "type" : "APACHE_DRILL_PHYSICAL",
    "options" : [ ],
    "queue" : 0,
    "resultMode" : "EXEC"
  },
  "graph" : [ {
    "pop" : "fs-scan",
    "@id" : 3,
    "files" : [ "maprfs:/test/smalltable/2016/t1.csv" ],
    "storage" : {
      "type" : "file",
      "enabled" : true,
      "connection" : "maprfs:///",
      "workspaces" : {
        "root" : {
          "location" : "/",
          "writable" : false,
          "defaultInputFormat" : null
        },
...
...
{code}
With added second predicate (dir1 = mindir('dfs.test', 'bigtable/2016') which 
evaluates to false (there is no directory '01' in smalltable)
we end up scanning everything in the smalltable. This does not look right to me 
and I think this is a bug.

{code}
0: jdbc:drill:schema=dfs> explain plan for select * from smalltable where dir0 
= maxdir('dfs.test', 'bigtable') and dir1 = mindir('dfs.test', 'bigtable/2016');
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(*=[$0])
00-02        Project(T15¦¦*=[$0])
00-03          SelectionVectorRemover
00-04            Filter(condition=[AND(=($1, '2016'), =($2, '01'))])
00-05              Project(T15¦¦*=[$0], dir0=[$1], dir1=[$2])
00-06                Scan(groupscan=[EasyGroupScan 
[selectionRoot=/test/smalltable, numFiles=3, columns=[`*`], 
files=[maprfs:/test/smalltable/2015/01/t1.csv, 
maprfs:/test/smalltable/2015/02/t2.csv, maprfs:/test/smalltable/2016/t1.csv]]])
 | {
  "head" : {
    "version" : 1,
    "generator" : {
      "type" : "ExplainHandler",
      "info" : ""
    },
    "type" : "APACHE_DRILL_PHYSICAL",
    "options" : [ ],
    "queue" : 0,
    "resultMode" : "EXEC"
  },
  "graph" : [ {
    "pop" : "fs-scan",
    "@id" : 6,
    "files" : [ "maprfs:/test/smalltable/2015/01/t1.csv", 
"maprfs:/test/smalltable/2015/02/t2.csv", "maprfs:/test/smalltable/2016/t1.csv" 
],
    "storage" : {
      "type" : "file",
      "enabled" : true,
      "connection" : "maprfs:///",
      "workspaces" : {
        "root" : {
          "location" : "/",
          "writable" : false,
          "defaultInputFormat" : null
        },
...
...
{code}

Here is a similar example with parquet file where predicate "a1=11" evaluates 
to false.

{code}
0: jdbc:drill:schema=dfs> explain plan for select * from bigtable where 
dir0=maxdir('dfs.test','bigtable') and a1 = 11;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(*=[$0])
00-02        Project(T25¦¦*=[$0])
00-03          SelectionVectorRemover
00-04            Filter(condition=[AND(=($1, '2016'), =($2, 11))])
00-05              Project(T25¦¦*=[$0], dir0=[$1], a1=[$2])
00-06                Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=maprfs:/test/bigtable/2016/01/0_0_0.parquet], 
ReadEntryWithPath [path=maprfs:/test/bigtable/2016/parquet.file]], 
selectionRoot=/test/bigtable, numFiles=2, columns=[`*`]]])
 | {
  "head" : {
    "version" : 1,
    "generator" : {
      "type" : "ExplainHandler",
      "info" : ""
    },
    "type" : "APACHE_DRILL_PHYSICAL",
    "options" : [ ],
    "queue" : 0,
    "resultMode" : "EXEC"
  },
  "graph" : [ {
    "pop" : "parquet-scan",
    "@id" : 6,
    "entries" : [ {
      "path" : "maprfs:/test/bigtable/2016/01/0_0_0.parquet"
    }, {
      "path" : "maprfs:/test/bigtable/2016/parquet.file"
    } ],
{code}

And finally, when we use the same table in the from clause and in 
maxdir/mindir, we scan only one file (to return schema):
I would think that the same should happen in the bug case above ...
{code}
0: jdbc:drill:schema=dfs> explain plan for select * from bigtable where dir0 = 
maxdir('dfs.test', 'bigtable') and dir1 = mindir('dfs.test', 'bigtable/2016');
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(*=[$0])
00-02        Project(T29¦¦*=[$0])
00-03          SelectionVectorRemover
00-04            Filter(condition=[AND(=($1, '2016'), =($2, 'parquet.file'))])
00-05              Project(T29¦¦*=[$0], dir0=[$1], dir1=[$2])
00-06                Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath 
[path=maprfs:/test/bigtable/2015/01/4_0_0.parquet]], 
selectionRoot=/test/bigtable, numFiles=1, columns=[`*`]]])
 | {
  "head" : {
    "version" : 1,
    "generator" : {
      "type" : "ExplainHandler",
      "info" : ""
    },
    "type" : "APACHE_DRILL_PHYSICAL",
    "options" : [ ],
    "queue" : 0,
    "resultMode" : "EXEC"
  },
  "graph" : [ {
    "pop" : "parquet-scan",
    "@id" : 6,
    "entries" : [ {
      "path" : "maprfs:/test/bigtable/2015/01/4_0_0.parquet"
    } ],
{code}








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to