[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets

2019-04-15 Thread Bridget Bevens (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818564#comment-16818564
 ] 

Bridget Bevens commented on DRILL-7038:
---

Hi [~KazydubB],

I changed the text based on your feedback. I've updated the following page with 
the information: 
https://drill.apache.org/docs/querying-directories/#querying-partitioned-directories
 

Setting doc status to complete, but please let me know if I need to make any 
changes.

Thanks,
Bridget

> Queries on partitioned columns scan the entire datasets
> ---
>
> Key: DRILL-7038
> URL: https://issues.apache.org/jira/browse/DRILL-7038
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.16.0
>
>
> For tables with hive-style partitions like
> {code}
> /table/2018/Q1
> /table/2018/Q2
> /table/2019/Q1
> etc.
> {code}
> if any of the following queries is run:
> {code}
> select distinct dir0 from dfs.`/table`
> {code}
> {code}
> select dir0 from dfs.`/table` group by dir0
> {code}
> it will actually scan every single record in the table rather than just 
> getting a list of directories at the dir0 level. This applies even when 
> cached metadata is available. This is a big penalty especially as the 
> datasets grow.
> To avoid such situations, a logical prune rule can be used to collect 
> partition columns (`dir0`), either from metadata cache (if available) or 
> group scan, and drop unnecessary files from being read. The rule will be 
> applied on following conditions:
> 1) all queried columns are partitoin columns, and
> 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets

2019-04-02 Thread Bohdan Kazydub (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808416#comment-16808416
 ] 

Bohdan Kazydub commented on DRILL-7038:
---

Hi, [~bbevens]. I think it's OK, but I think it is needed to specify that 
additionally for {{DISTINCT}} or {{GROUP BY}} operation the query has to query 
({{SELECT}}) partition columns (dir0, dir1,..., dirN) only.

> Queries on partitioned columns scan the entire datasets
> ---
>
> Key: DRILL-7038
> URL: https://issues.apache.org/jira/browse/DRILL-7038
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.16.0
>
>
> For tables with hive-style partitions like
> {code}
> /table/2018/Q1
> /table/2018/Q2
> /table/2019/Q1
> etc.
> {code}
> if any of the following queries is run:
> {code}
> select distinct dir0 from dfs.`/table`
> {code}
> {code}
> select dir0 from dfs.`/table` group by dir0
> {code}
> it will actually scan every single record in the table rather than just 
> getting a list of directories at the dir0 level. This applies even when 
> cached metadata is available. This is a big penalty especially as the 
> datasets grow.
> To avoid such situations, a logical prune rule can be used to collect 
> partition columns (`dir0`), either from metadata cache (if available) or 
> group scan, and drop unnecessary files from being read. The rule will be 
> applied on following conditions:
> 1) all queried columns are partitoin columns, and
> 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets

2019-03-29 Thread Bridget Bevens (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805567#comment-16805567
 ] 

Bridget Bevens commented on DRILL-7038:
---

Thanks, [~KazydubB]!

Is this okay to add to the document?

Starting in 1.16, Drill uses a Values operator instead of the Scan operator for 
DISTINCT and GROUP BY queries on tables and directories, as shown in the 
following examples:

select distinct dir0 from `/logs`;
+--+
| dir0 |
+--+
| 2015 |
| 2016 |
| 2017 |
+--+


explain plan for select distinct dir0 from `/logs`;
+--+--+
|   text
   |   json 
  |
+--+--+
| 00-00Screen
00-01  Project(dir0=[$0])
00-02StreamAgg(group=[{0}])
00-03  Sort(sort0=[$0], dir0=[ASC])
00-04Values(tuples=[[{ '2015' }, { '2015' }, { '2016' }, { '2015' 
}, { '2017' }, { '2015' }]])



select dir0 from `/logs` group by dir0;
+--+
| dir0 |
+--+
| 2015 |
| 2016 |
| 2017 |
+--+

explain plan for select dir0 from `/logs` group by dir0;
>
+--+--+
|   text
   |   json 
  |
+--+--+
| 00-00Screen
00-01  Project(dir0=[$0])
00-02StreamAgg(group=[{0}])
00-03  Sort(sort0=[$0], dir0=[ASC])
00-04Values(tuples=[[{ '2015' }, { '2015' }, { '2016' }, { '2015' 
}, { '2017' }, { '2015' }]])

Thanks,
Bridget

> Queries on partitioned columns scan the entire datasets
> ---
>
> Key: DRILL-7038
> URL: https://issues.apache.org/jira/browse/DRILL-7038
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.16.0
>
>
> For tables with hive-style partitions like
> {code}
> /table/2018/Q1
> /table/2018/Q2
> /table/2019/Q1
> etc.
> {code}
> if any of the following queries is run:
> {code}
> select distinct dir0 from dfs.`/table`
> {code}
> {code}
> select dir0 from dfs.`/table` group by dir0
> {code}
> it will actually scan every single record in the table rather than just 
> getting a list of directories at the dir0 level. This applies even when 
> cached metadata is available. This is a big penalty especially as the 
> datasets grow.
> To avoid such situations, a logical prune rule can be used to collect 
> partition columns (`dir0`), either from metadata cache (if available) or 
> group scan, and drop unnecessary files from being read. The rule will be 
> applied on following conditions:
> 1) all queried columns are partitoin columns, and
> 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets

2019-03-28 Thread Bohdan Kazydub (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803689#comment-16803689
 ] 

Bohdan Kazydub commented on DRILL-7038:
---

Hi, [~bbevens]. No, it's not like that. Those {{dir0}}, {{dir1}}, ... columns 
refer to directory levels from root directory (see [Querying 
Directories|https://drill.apache.org/docs/querying-directories/]).  For 
example, if {{table1}} had following directory structure:
{code}
/table1/2016/Q1
/table1/2016/Q2
...
{code}
and when querying
{code}
select distinct dir0[, dir1[,...]] from dfs.`/table1`;
select dir0[, dir1[,...]] from dfs.`/table1` group by dir0;
{code}
{{dir0}} references first level directories from `table1` (which is root), i.e. 
'2016' directory, {{dir1}} references second level directories 'Q1' and 'Q2' 
and so on.

Before, Drill was scanning all the *files* in all directories. With this 
optimization, file scanning is discarded and Scan operator is replaced with 
Values operator containing literal values, with this values being collected 
from directory metadata cache file (if exists) or from scan file selection.

> Queries on partitioned columns scan the entire datasets
> ---
>
> Key: DRILL-7038
> URL: https://issues.apache.org/jira/browse/DRILL-7038
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.16.0
>
>
> For tables with hive-style partitions like
> {code}
> /table/2018/Q1
> /table/2018/Q2
> /table/2019/Q1
> etc.
> {code}
> if any of the following queries is run:
> {code}
> select distinct dir0 from dfs.`/table`
> {code}
> {code}
> select dir0 from dfs.`/table` group by dir0
> {code}
> it will actually scan every single record in the table rather than just 
> getting a list of directories at the dir0 level. This applies even when 
> cached metadata is available. This is a big penalty especially as the 
> datasets grow.
> To avoid such situations, a logical prune rule can be used to collect 
> partition columns (`dir0`), either from metadata cache (if available) or 
> group scan, and drop unnecessary files from being read. The rule will be 
> applied on following conditions:
> 1) all queried columns are partitoin columns, and
> 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets

2019-03-27 Thread Bridget Bevens (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803526#comment-16803526
 ] 

Bridget Bevens commented on DRILL-7038:
---

Hi [~KazydubB],

I'm planning to add the following information to this page: 
https://drill.apache.org/docs/querying-directories/ 
Can you please review and let me know if the information is accurate?

Starting in Drill 1.16, if a query on a file system table references 
partitioned columns only and also has a DISTINCT or GROUP BY operation, Drill 
only reads the data partition(s) referenced. For example, if a table, table1, 
is partitioned into the following directories:
{{/table1/2016
/table1/2017
/table1/2018}}

And you run either of the following queries against table1:

{{select distinct dir0 from dfs.`/table`;
select dir0 from dfs.`/table` group by dir0;}}

Drill only scans dir0 (/table1/2016) instead of scanning all three of the 
directories.  

Thanks,
Bridget


> Queries on partitioned columns scan the entire datasets
> ---
>
> Key: DRILL-7038
> URL: https://issues.apache.org/jira/browse/DRILL-7038
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.16.0
>
>
> For tables with hive-style partitions like
> {code}
> /table/2018/Q1
> /table/2018/Q2
> /table/2019/Q1
> etc.
> {code}
> if any of the following queries is run:
> {code}
> select distinct dir0 from dfs.`/table`
> {code}
> {code}
> select dir0 from dfs.`/table` group by dir0
> {code}
> it will actually scan every single record in the table rather than just 
> getting a list of directories at the dir0 level. This applies even when 
> cached metadata is available. This is a big penalty especially as the 
> datasets grow.
> To avoid such situations, a logical prune rule can be used to collect 
> partition columns (`dir0`), either from metadata cache (if available) or 
> group scan, and drop unnecessary files from being read. The rule will be 
> applied on following conditions:
> 1) all queried columns are partitoin columns, and
> 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)