Andries Engelbrecht created DRILL-2157: ------------------------------------------
Summary: Directory pruning on subdirectories only and data type conversions for directory filters Key: DRILL-2157 URL: https://issues.apache.org/jira/browse/DRILL-2157 Project: Apache Drill Issue Type: Improvement Components: Query Planning & Optimization Affects Versions: Future Reporter: Andries Engelbrecht Assignee: Jinfeng Ni Priority: Minor Drill will scan all files and directories when using only a subdirectory as a predicate. Additionally if the data type for the directory filter is not a string and is converted Drill will also first scan all the subdirectories adn files before applying the filter. My current observation is that for a directory structure as listed below, the pruning only works if the full tree is provided. If only a lower level directory is supplied in the filter condition Drill only uses it as a filter. With directory structure as below /2015 /01 /10 /11 /12 /13 /14 Query: select count(id) from `/foo` t where dir0='2015' and dir1='01' and dir2='10' Produces the correct pruning and query plan 01-02 Project(id=[$3]): rowcount = 3670316.0, cumulative cost = {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 28434 01-03 Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]): rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 28433 01-04 Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`] However: select count(id) from `/foo` t where dir2='10' Produces full scan of all sub directories and only applies a filter condition after the fact. Notice the numFiles between the 2, even though it lists columns in the base scan 01-04 Filter(condition=[=($0, '10')]): rowcount = 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 27470 01-05 Project(dir2=[$1], id=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 27469 01-06 Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`] Also using the wrong data type for the filter produces a full scan select count(id) from `/foo` where dir_year=2015 and dir_month=01 and dir_day=14 Produces 01-04 Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34910 01-05 Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909 01-06 Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`], -- This message was sent by Atlassian JIRA (v6.3.4#6332)