[jira] [Commented] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters

2015-04-08 Thread Andries Engelbrecht (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486406#comment-14486406
 ] 

Andries Engelbrecht commented on DRILL-2157:


Can be marked as resolved.

Tested with following use cases and shows the correct numFiles being scanned.
Casting or string no longer needed for lower level directories.

Seems to be resolved in Drill 0.8 for direct DFS access

select count(id) from `/nfl` where dir0=2015 and dir1=01 and dir2=20 and dir3 
between 00 and 05;

select count(id) from `/nfl` where dir0=2015 and dir1=01 and dir2 between 20 
and 25 and dir3 between 00 and 05;

select count(id) from `/nfl` where dir0=2015 and dir1=01 and dir2 between 20 
and 25 and dir3 between 00 and 05;

select count(id) from `/nfl` where dir2 between 20 and 25 and dir3 between 00 
and 05;

select count(id) from `/nfl` where dir225 and dir3 between 00 and 05;

For Views also seems to report the correct number of files to be scanned with 
directory pruning

select count(id) from dfs.views.tweet_base where dir_year=2015 and dir_month=01 
and dir_day=26 and dir_hour20;

select count(id) from dfs.views.tweet_base where dir_day between 20 and 26 and 
dir_hour20;

 Directory pruning on subdirectories only and data type conversions for 
 directory filters
 

 Key: DRILL-2157
 URL: https://issues.apache.org/jira/browse/DRILL-2157
 Project: Apache Drill
  Issue Type: Improvement
  Components: Query Planning  Optimization
Affects Versions: Future
Reporter: Andries Engelbrecht
Assignee: Aman Sinha
Priority: Minor
 Fix For: 0.8.0


 Drill will scan all files and directories when using only a subdirectory as a 
 predicate. Additionally if the data type for the directory filter is not a 
 string and is converted Drill will also first scan all the subdirectories adn 
 files before applying the filter.
 My current observation is that for a directory structure as listed below,
 the pruning only works if the full tree is provided. If only a lower level
 directory is supplied in the filter condition Drill only uses it as a
 filter.
 With directory structure as below
 /2015
 /01
/10
/11
/12
/13
/14
 Query:
 select count(id) from `/foo` t where dir0='2015' and dir1='01' and
 dir2='10'
 Produces the correct pruning and query plan
 01-02Project(id=[$3]): rowcount = 3670316.0, cumulative cost =
 {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
 28434
 01-03  Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]):
 rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu,
 0.0 io, 0.0 network, 0.0 memory}, id = 28433
 01-04Scan(groupscan=[EasyGroupScan [selectionRoot=/foo,
 numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
 However:
 select count(id) from `/foo` t where dir2='10'
 Produces full scan of all sub directories and only applies a filter
 condition after the fact. Notice the numFiles between the 2, even though it
 lists columns in the base scan
 01-04Filter(condition=[=($0, '10')]): rowcount =
 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io,
 0.0 network, 0.0 memory}, id = 27470
 01-05  Project(dir2=[$1], id=[$0]): rowcount =
 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0
 io, 0.0 network, 0.0 memory}, id = 27469
 01-06Scan(groupscan=[EasyGroupScan
 [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
 Also using the wrong data type for the filter produces a full scan
 select count(id) from `/foo` where dir_year=2015 and dir_month=01 and 
 dir_day=14
 Produces
 01-04Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), 
 =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 
 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 
 io, 0.0 network, 0.0 memory}, id = 34910
 01-05  Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): 
 rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 
 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909
 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, 
 numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`],



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters

2015-03-22 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375041#comment-14375041
 ] 

Aman Sinha commented on DRILL-2157:
---

Most of the issues listed here have been fixed by DRILL-2353 and are available 
in the 0.8 release.  This includes filters that have range conditions (between, 
,  etc.)  and filters on subdirectories without specifying parent directory.  
I am not completely sure about the casting issue since I could not reproduce 
the original issue.   

I am marking this as fixed for 0.8.  [~aengelbrecht] could you please try the 
0.8 release and verify ?  If there are remaining issues, pls open a new JIRA.  

 Directory pruning on subdirectories only and data type conversions for 
 directory filters
 

 Key: DRILL-2157
 URL: https://issues.apache.org/jira/browse/DRILL-2157
 Project: Apache Drill
  Issue Type: Improvement
  Components: Query Planning  Optimization
Affects Versions: Future
Reporter: Andries Engelbrecht
Assignee: Aman Sinha
Priority: Minor
 Fix For: 0.8.0


 Drill will scan all files and directories when using only a subdirectory as a 
 predicate. Additionally if the data type for the directory filter is not a 
 string and is converted Drill will also first scan all the subdirectories adn 
 files before applying the filter.
 My current observation is that for a directory structure as listed below,
 the pruning only works if the full tree is provided. If only a lower level
 directory is supplied in the filter condition Drill only uses it as a
 filter.
 With directory structure as below
 /2015
 /01
/10
/11
/12
/13
/14
 Query:
 select count(id) from `/foo` t where dir0='2015' and dir1='01' and
 dir2='10'
 Produces the correct pruning and query plan
 01-02Project(id=[$3]): rowcount = 3670316.0, cumulative cost =
 {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
 28434
 01-03  Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]):
 rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu,
 0.0 io, 0.0 network, 0.0 memory}, id = 28433
 01-04Scan(groupscan=[EasyGroupScan [selectionRoot=/foo,
 numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
 However:
 select count(id) from `/foo` t where dir2='10'
 Produces full scan of all sub directories and only applies a filter
 condition after the fact. Notice the numFiles between the 2, even though it
 lists columns in the base scan
 01-04Filter(condition=[=($0, '10')]): rowcount =
 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io,
 0.0 network, 0.0 memory}, id = 27470
 01-05  Project(dir2=[$1], id=[$0]): rowcount =
 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0
 io, 0.0 network, 0.0 memory}, id = 27469
 01-06Scan(groupscan=[EasyGroupScan
 [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
 Also using the wrong data type for the filter produces a full scan
 select count(id) from `/foo` where dir_year=2015 and dir_month=01 and 
 dir_day=14
 Produces
 01-04Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), 
 =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 
 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 
 io, 0.0 network, 0.0 memory}, id = 34910
 01-05  Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): 
 rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 
 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909
 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, 
 numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`],



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters

2015-03-22 Thread Andries Engelbrecht (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375103#comment-14375103
 ] 

Andries Engelbrecht commented on DRILL-2157:


Aman I will test the conditions, casting and views with the final 0.8 release.

Will file a new JIRA if there are any specific conditions that may still cause 
issues.

 Directory pruning on subdirectories only and data type conversions for 
 directory filters
 

 Key: DRILL-2157
 URL: https://issues.apache.org/jira/browse/DRILL-2157
 Project: Apache Drill
  Issue Type: Improvement
  Components: Query Planning  Optimization
Affects Versions: Future
Reporter: Andries Engelbrecht
Assignee: Aman Sinha
Priority: Minor
 Fix For: 0.8.0


 Drill will scan all files and directories when using only a subdirectory as a 
 predicate. Additionally if the data type for the directory filter is not a 
 string and is converted Drill will also first scan all the subdirectories adn 
 files before applying the filter.
 My current observation is that for a directory structure as listed below,
 the pruning only works if the full tree is provided. If only a lower level
 directory is supplied in the filter condition Drill only uses it as a
 filter.
 With directory structure as below
 /2015
 /01
/10
/11
/12
/13
/14
 Query:
 select count(id) from `/foo` t where dir0='2015' and dir1='01' and
 dir2='10'
 Produces the correct pruning and query plan
 01-02Project(id=[$3]): rowcount = 3670316.0, cumulative cost =
 {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
 28434
 01-03  Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]):
 rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu,
 0.0 io, 0.0 network, 0.0 memory}, id = 28433
 01-04Scan(groupscan=[EasyGroupScan [selectionRoot=/foo,
 numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
 However:
 select count(id) from `/foo` t where dir2='10'
 Produces full scan of all sub directories and only applies a filter
 condition after the fact. Notice the numFiles between the 2, even though it
 lists columns in the base scan
 01-04Filter(condition=[=($0, '10')]): rowcount =
 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io,
 0.0 network, 0.0 memory}, id = 27470
 01-05  Project(dir2=[$1], id=[$0]): rowcount =
 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0
 io, 0.0 network, 0.0 memory}, id = 27469
 01-06Scan(groupscan=[EasyGroupScan
 [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
 Also using the wrong data type for the filter produces a full scan
 select count(id) from `/foo` where dir_year=2015 and dir_month=01 and 
 dir_day=14
 Produces
 01-04Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), 
 =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 
 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 
 io, 0.0 network, 0.0 memory}, id = 34910
 01-05  Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): 
 rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 
 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909
 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, 
 numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`],



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters

2015-02-04 Thread Andries Engelbrecht (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305368#comment-14305368
 ] 

Andries Engelbrecht commented on DRILL-2157:


In addition the ability to perform directory pruning with  and  expressions.

Also if the ability to support directory pruning in views where there is a 
predicate filter i.e. the view below is unable to perform directory pruning 
while views without the predicate filter are.

Create or replace view maprfs.views.`retweeted` as
select CAST(t.`id` as BIGINT) as `id`, 
CAST(t.retweeted_status.`id` as BIGINT) as `retweet_id`,
t.dir0 as dir_year,
t.dir1 as dir_month,
t.dir2 as dir_day,
t.dir3 as dir_hour,
CAST(t.retweeted_status.`created_at` as VARCHAR(40)) as `created_at`,
to_date ((concat (substring(t.retweeted_status.`created_at`, 
5,6),substring(t.retweeted_status.`created_at`, 26,5))), 'MMM dd ') as 
`date`,
to_timestamp ((concat (substring(t.retweeted_status.`created_at`, 
5,6),substring(t.retweeted_status.`created_at`, 
26,5),substring(t.retweeted_status.`created_at`, 11,9))), 'MMM dd  
HH:mm:ss') as `timestamp`,
CAST(t.retweeted_status.`text` as VARCHAR(140)) as `tweet`,
CAST(t.retweeted_status.`user`.`favorites_count` as INT) as `favorites_count`
from maprfs.twitter.`/nfl` t where t.retweeted_status.`user`.`name` is not null;


 Directory pruning on subdirectories only and data type conversions for 
 directory filters
 

 Key: DRILL-2157
 URL: https://issues.apache.org/jira/browse/DRILL-2157
 Project: Apache Drill
  Issue Type: Improvement
  Components: Query Planning  Optimization
Affects Versions: Future
Reporter: Andries Engelbrecht
Assignee: Jinfeng Ni
Priority: Minor

 Drill will scan all files and directories when using only a subdirectory as a 
 predicate. Additionally if the data type for the directory filter is not a 
 string and is converted Drill will also first scan all the subdirectories adn 
 files before applying the filter.
 My current observation is that for a directory structure as listed below,
 the pruning only works if the full tree is provided. If only a lower level
 directory is supplied in the filter condition Drill only uses it as a
 filter.
 With directory structure as below
 /2015
 /01
/10
/11
/12
/13
/14
 Query:
 select count(id) from `/foo` t where dir0='2015' and dir1='01' and
 dir2='10'
 Produces the correct pruning and query plan
 01-02Project(id=[$3]): rowcount = 3670316.0, cumulative cost =
 {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
 28434
 01-03  Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]):
 rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu,
 0.0 io, 0.0 network, 0.0 memory}, id = 28433
 01-04Scan(groupscan=[EasyGroupScan [selectionRoot=/foo,
 numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
 However:
 select count(id) from `/foo` t where dir2='10'
 Produces full scan of all sub directories and only applies a filter
 condition after the fact. Notice the numFiles between the 2, even though it
 lists columns in the base scan
 01-04Filter(condition=[=($0, '10')]): rowcount =
 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io,
 0.0 network, 0.0 memory}, id = 27470
 01-05  Project(dir2=[$1], id=[$0]): rowcount =
 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0
 io, 0.0 network, 0.0 memory}, id = 27469
 01-06Scan(groupscan=[EasyGroupScan
 [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
 Also using the wrong data type for the filter produces a full scan
 select count(id) from `/foo` where dir_year=2015 and dir_month=01 and 
 dir_day=14
 Produces
 01-04Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), 
 =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 
 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 
 io, 0.0 network, 0.0 memory}, id = 34910
 01-05  Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): 
 rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 
 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909
 01-06Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, 
 numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`],



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)