[jira] [Commented] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters
[ https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486406#comment-14486406 ] Andries Engelbrecht commented on DRILL-2157: Can be marked as resolved. Tested with following use cases and shows the correct numFiles being scanned. Casting or string no longer needed for lower level directories. Seems to be resolved in Drill 0.8 for direct DFS access select count(id) from `/nfl` where dir0=2015 and dir1=01 and dir2=20 and dir3 between 00 and 05; select count(id) from `/nfl` where dir0=2015 and dir1=01 and dir2 between 20 and 25 and dir3 between 00 and 05; select count(id) from `/nfl` where dir0=2015 and dir1=01 and dir2 between 20 and 25 and dir3 between 00 and 05; select count(id) from `/nfl` where dir2 between 20 and 25 and dir3 between 00 and 05; select count(id) from `/nfl` where dir225 and dir3 between 00 and 05; For Views also seems to report the correct number of files to be scanned with directory pruning select count(id) from dfs.views.tweet_base where dir_year=2015 and dir_month=01 and dir_day=26 and dir_hour20; select count(id) from dfs.views.tweet_base where dir_day between 20 and 26 and dir_hour20; Directory pruning on subdirectories only and data type conversions for directory filters Key: DRILL-2157 URL: https://issues.apache.org/jira/browse/DRILL-2157 Project: Apache Drill Issue Type: Improvement Components: Query Planning Optimization Affects Versions: Future Reporter: Andries Engelbrecht Assignee: Aman Sinha Priority: Minor Fix For: 0.8.0 Drill will scan all files and directories when using only a subdirectory as a predicate. Additionally if the data type for the directory filter is not a string and is converted Drill will also first scan all the subdirectories adn files before applying the filter. My current observation is that for a directory structure as listed below, the pruning only works if the full tree is provided. If only a lower level directory is supplied in the filter condition Drill only uses it as a filter. With directory structure as below /2015 /01 /10 /11 /12 /13 /14 Query: select count(id) from `/foo` t where dir0='2015' and dir1='01' and dir2='10' Produces the correct pruning and query plan 01-02Project(id=[$3]): rowcount = 3670316.0, cumulative cost = {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 28434 01-03 Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]): rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 28433 01-04Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`] However: select count(id) from `/foo` t where dir2='10' Produces full scan of all sub directories and only applies a filter condition after the fact. Notice the numFiles between the 2, even though it lists columns in the base scan 01-04Filter(condition=[=($0, '10')]): rowcount = 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 27470 01-05 Project(dir2=[$1], id=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 27469 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`] Also using the wrong data type for the filter produces a full scan select count(id) from `/foo` where dir_year=2015 and dir_month=01 and dir_day=14 Produces 01-04Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34910 01-05 Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`], -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters
[ https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375041#comment-14375041 ] Aman Sinha commented on DRILL-2157: --- Most of the issues listed here have been fixed by DRILL-2353 and are available in the 0.8 release. This includes filters that have range conditions (between, , etc.) and filters on subdirectories without specifying parent directory. I am not completely sure about the casting issue since I could not reproduce the original issue. I am marking this as fixed for 0.8. [~aengelbrecht] could you please try the 0.8 release and verify ? If there are remaining issues, pls open a new JIRA. Directory pruning on subdirectories only and data type conversions for directory filters Key: DRILL-2157 URL: https://issues.apache.org/jira/browse/DRILL-2157 Project: Apache Drill Issue Type: Improvement Components: Query Planning Optimization Affects Versions: Future Reporter: Andries Engelbrecht Assignee: Aman Sinha Priority: Minor Fix For: 0.8.0 Drill will scan all files and directories when using only a subdirectory as a predicate. Additionally if the data type for the directory filter is not a string and is converted Drill will also first scan all the subdirectories adn files before applying the filter. My current observation is that for a directory structure as listed below, the pruning only works if the full tree is provided. If only a lower level directory is supplied in the filter condition Drill only uses it as a filter. With directory structure as below /2015 /01 /10 /11 /12 /13 /14 Query: select count(id) from `/foo` t where dir0='2015' and dir1='01' and dir2='10' Produces the correct pruning and query plan 01-02Project(id=[$3]): rowcount = 3670316.0, cumulative cost = {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 28434 01-03 Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]): rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 28433 01-04Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`] However: select count(id) from `/foo` t where dir2='10' Produces full scan of all sub directories and only applies a filter condition after the fact. Notice the numFiles between the 2, even though it lists columns in the base scan 01-04Filter(condition=[=($0, '10')]): rowcount = 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 27470 01-05 Project(dir2=[$1], id=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 27469 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`] Also using the wrong data type for the filter produces a full scan select count(id) from `/foo` where dir_year=2015 and dir_month=01 and dir_day=14 Produces 01-04Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34910 01-05 Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`], -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters
[ https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375103#comment-14375103 ] Andries Engelbrecht commented on DRILL-2157: Aman I will test the conditions, casting and views with the final 0.8 release. Will file a new JIRA if there are any specific conditions that may still cause issues. Directory pruning on subdirectories only and data type conversions for directory filters Key: DRILL-2157 URL: https://issues.apache.org/jira/browse/DRILL-2157 Project: Apache Drill Issue Type: Improvement Components: Query Planning Optimization Affects Versions: Future Reporter: Andries Engelbrecht Assignee: Aman Sinha Priority: Minor Fix For: 0.8.0 Drill will scan all files and directories when using only a subdirectory as a predicate. Additionally if the data type for the directory filter is not a string and is converted Drill will also first scan all the subdirectories adn files before applying the filter. My current observation is that for a directory structure as listed below, the pruning only works if the full tree is provided. If only a lower level directory is supplied in the filter condition Drill only uses it as a filter. With directory structure as below /2015 /01 /10 /11 /12 /13 /14 Query: select count(id) from `/foo` t where dir0='2015' and dir1='01' and dir2='10' Produces the correct pruning and query plan 01-02Project(id=[$3]): rowcount = 3670316.0, cumulative cost = {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 28434 01-03 Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]): rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 28433 01-04Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`] However: select count(id) from `/foo` t where dir2='10' Produces full scan of all sub directories and only applies a filter condition after the fact. Notice the numFiles between the 2, even though it lists columns in the base scan 01-04Filter(condition=[=($0, '10')]): rowcount = 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 27470 01-05 Project(dir2=[$1], id=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 27469 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`] Also using the wrong data type for the filter produces a full scan select count(id) from `/foo` where dir_year=2015 and dir_month=01 and dir_day=14 Produces 01-04Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34910 01-05 Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`], -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters
[ https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305368#comment-14305368 ] Andries Engelbrecht commented on DRILL-2157: In addition the ability to perform directory pruning with and expressions. Also if the ability to support directory pruning in views where there is a predicate filter i.e. the view below is unable to perform directory pruning while views without the predicate filter are. Create or replace view maprfs.views.`retweeted` as select CAST(t.`id` as BIGINT) as `id`, CAST(t.retweeted_status.`id` as BIGINT) as `retweet_id`, t.dir0 as dir_year, t.dir1 as dir_month, t.dir2 as dir_day, t.dir3 as dir_hour, CAST(t.retweeted_status.`created_at` as VARCHAR(40)) as `created_at`, to_date ((concat (substring(t.retweeted_status.`created_at`, 5,6),substring(t.retweeted_status.`created_at`, 26,5))), 'MMM dd ') as `date`, to_timestamp ((concat (substring(t.retweeted_status.`created_at`, 5,6),substring(t.retweeted_status.`created_at`, 26,5),substring(t.retweeted_status.`created_at`, 11,9))), 'MMM dd HH:mm:ss') as `timestamp`, CAST(t.retweeted_status.`text` as VARCHAR(140)) as `tweet`, CAST(t.retweeted_status.`user`.`favorites_count` as INT) as `favorites_count` from maprfs.twitter.`/nfl` t where t.retweeted_status.`user`.`name` is not null; Directory pruning on subdirectories only and data type conversions for directory filters Key: DRILL-2157 URL: https://issues.apache.org/jira/browse/DRILL-2157 Project: Apache Drill Issue Type: Improvement Components: Query Planning Optimization Affects Versions: Future Reporter: Andries Engelbrecht Assignee: Jinfeng Ni Priority: Minor Drill will scan all files and directories when using only a subdirectory as a predicate. Additionally if the data type for the directory filter is not a string and is converted Drill will also first scan all the subdirectories adn files before applying the filter. My current observation is that for a directory structure as listed below, the pruning only works if the full tree is provided. If only a lower level directory is supplied in the filter condition Drill only uses it as a filter. With directory structure as below /2015 /01 /10 /11 /12 /13 /14 Query: select count(id) from `/foo` t where dir0='2015' and dir1='01' and dir2='10' Produces the correct pruning and query plan 01-02Project(id=[$3]): rowcount = 3670316.0, cumulative cost = {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 28434 01-03 Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]): rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 28433 01-04Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`] However: select count(id) from `/foo` t where dir2='10' Produces full scan of all sub directories and only applies a filter condition after the fact. Notice the numFiles between the 2, even though it lists columns in the base scan 01-04Filter(condition=[=($0, '10')]): rowcount = 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 27470 01-05 Project(dir2=[$1], id=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 27469 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`] Also using the wrong data type for the filter produces a full scan select count(id) from `/foo` where dir_year=2015 and dir_month=01 and dir_day=14 Produces 01-04Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34910 01-05 Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`], -- This message was sent by Atlassian JIRA (v6.3.4#6332)