[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets
[ https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818564#comment-16818564 ] Bridget Bevens commented on DRILL-7038: --- Hi [~KazydubB], I changed the text based on your feedback. I've updated the following page with the information: https://drill.apache.org/docs/querying-directories/#querying-partitioned-directories Setting doc status to complete, but please let me know if I need to make any changes. Thanks, Bridget > Queries on partitioned columns scan the entire datasets > --- > > Key: DRILL-7038 > URL: https://issues.apache.org/jira/browse/DRILL-7038 > Project: Apache Drill > Issue Type: Improvement >Reporter: Bohdan Kazydub >Assignee: Bohdan Kazydub >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.16.0 > > > For tables with hive-style partitions like > {code} > /table/2018/Q1 > /table/2018/Q2 > /table/2019/Q1 > etc. > {code} > if any of the following queries is run: > {code} > select distinct dir0 from dfs.`/table` > {code} > {code} > select dir0 from dfs.`/table` group by dir0 > {code} > it will actually scan every single record in the table rather than just > getting a list of directories at the dir0 level. This applies even when > cached metadata is available. This is a big penalty especially as the > datasets grow. > To avoid such situations, a logical prune rule can be used to collect > partition columns (`dir0`), either from metadata cache (if available) or > group scan, and drop unnecessary files from being read. The rule will be > applied on following conditions: > 1) all queried columns are partitoin columns, and > 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets
[ https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808416#comment-16808416 ] Bohdan Kazydub commented on DRILL-7038: --- Hi, [~bbevens]. I think it's OK, but I think it is needed to specify that additionally for {{DISTINCT}} or {{GROUP BY}} operation the query has to query ({{SELECT}}) partition columns (dir0, dir1,..., dirN) only. > Queries on partitioned columns scan the entire datasets > --- > > Key: DRILL-7038 > URL: https://issues.apache.org/jira/browse/DRILL-7038 > Project: Apache Drill > Issue Type: Improvement >Reporter: Bohdan Kazydub >Assignee: Bohdan Kazydub >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.16.0 > > > For tables with hive-style partitions like > {code} > /table/2018/Q1 > /table/2018/Q2 > /table/2019/Q1 > etc. > {code} > if any of the following queries is run: > {code} > select distinct dir0 from dfs.`/table` > {code} > {code} > select dir0 from dfs.`/table` group by dir0 > {code} > it will actually scan every single record in the table rather than just > getting a list of directories at the dir0 level. This applies even when > cached metadata is available. This is a big penalty especially as the > datasets grow. > To avoid such situations, a logical prune rule can be used to collect > partition columns (`dir0`), either from metadata cache (if available) or > group scan, and drop unnecessary files from being read. The rule will be > applied on following conditions: > 1) all queried columns are partitoin columns, and > 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets
[ https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805567#comment-16805567 ] Bridget Bevens commented on DRILL-7038: --- Thanks, [~KazydubB]! Is this okay to add to the document? Starting in 1.16, Drill uses a Values operator instead of the Scan operator for DISTINCT and GROUP BY queries on tables and directories, as shown in the following examples: select distinct dir0 from `/logs`; +--+ | dir0 | +--+ | 2015 | | 2016 | | 2017 | +--+ explain plan for select distinct dir0 from `/logs`; +--+--+ | text | json | +--+--+ | 00-00Screen 00-01 Project(dir0=[$0]) 00-02StreamAgg(group=[{0}]) 00-03 Sort(sort0=[$0], dir0=[ASC]) 00-04Values(tuples=[[{ '2015' }, { '2015' }, { '2016' }, { '2015' }, { '2017' }, { '2015' }]]) select dir0 from `/logs` group by dir0; +--+ | dir0 | +--+ | 2015 | | 2016 | | 2017 | +--+ explain plan for select dir0 from `/logs` group by dir0; > +--+--+ | text | json | +--+--+ | 00-00Screen 00-01 Project(dir0=[$0]) 00-02StreamAgg(group=[{0}]) 00-03 Sort(sort0=[$0], dir0=[ASC]) 00-04Values(tuples=[[{ '2015' }, { '2015' }, { '2016' }, { '2015' }, { '2017' }, { '2015' }]]) Thanks, Bridget > Queries on partitioned columns scan the entire datasets > --- > > Key: DRILL-7038 > URL: https://issues.apache.org/jira/browse/DRILL-7038 > Project: Apache Drill > Issue Type: Improvement >Reporter: Bohdan Kazydub >Assignee: Bohdan Kazydub >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.16.0 > > > For tables with hive-style partitions like > {code} > /table/2018/Q1 > /table/2018/Q2 > /table/2019/Q1 > etc. > {code} > if any of the following queries is run: > {code} > select distinct dir0 from dfs.`/table` > {code} > {code} > select dir0 from dfs.`/table` group by dir0 > {code} > it will actually scan every single record in the table rather than just > getting a list of directories at the dir0 level. This applies even when > cached metadata is available. This is a big penalty especially as the > datasets grow. > To avoid such situations, a logical prune rule can be used to collect > partition columns (`dir0`), either from metadata cache (if available) or > group scan, and drop unnecessary files from being read. The rule will be > applied on following conditions: > 1) all queried columns are partitoin columns, and > 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets
[ https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803689#comment-16803689 ] Bohdan Kazydub commented on DRILL-7038: --- Hi, [~bbevens]. No, it's not like that. Those {{dir0}}, {{dir1}}, ... columns refer to directory levels from root directory (see [Querying Directories|https://drill.apache.org/docs/querying-directories/]). For example, if {{table1}} had following directory structure: {code} /table1/2016/Q1 /table1/2016/Q2 ... {code} and when querying {code} select distinct dir0[, dir1[,...]] from dfs.`/table1`; select dir0[, dir1[,...]] from dfs.`/table1` group by dir0; {code} {{dir0}} references first level directories from `table1` (which is root), i.e. '2016' directory, {{dir1}} references second level directories 'Q1' and 'Q2' and so on. Before, Drill was scanning all the *files* in all directories. With this optimization, file scanning is discarded and Scan operator is replaced with Values operator containing literal values, with this values being collected from directory metadata cache file (if exists) or from scan file selection. > Queries on partitioned columns scan the entire datasets > --- > > Key: DRILL-7038 > URL: https://issues.apache.org/jira/browse/DRILL-7038 > Project: Apache Drill > Issue Type: Improvement >Reporter: Bohdan Kazydub >Assignee: Bohdan Kazydub >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.16.0 > > > For tables with hive-style partitions like > {code} > /table/2018/Q1 > /table/2018/Q2 > /table/2019/Q1 > etc. > {code} > if any of the following queries is run: > {code} > select distinct dir0 from dfs.`/table` > {code} > {code} > select dir0 from dfs.`/table` group by dir0 > {code} > it will actually scan every single record in the table rather than just > getting a list of directories at the dir0 level. This applies even when > cached metadata is available. This is a big penalty especially as the > datasets grow. > To avoid such situations, a logical prune rule can be used to collect > partition columns (`dir0`), either from metadata cache (if available) or > group scan, and drop unnecessary files from being read. The rule will be > applied on following conditions: > 1) all queried columns are partitoin columns, and > 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets
[ https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803526#comment-16803526 ] Bridget Bevens commented on DRILL-7038: --- Hi [~KazydubB], I'm planning to add the following information to this page: https://drill.apache.org/docs/querying-directories/ Can you please review and let me know if the information is accurate? Starting in Drill 1.16, if a query on a file system table references partitioned columns only and also has a DISTINCT or GROUP BY operation, Drill only reads the data partition(s) referenced. For example, if a table, table1, is partitioned into the following directories: {{/table1/2016 /table1/2017 /table1/2018}} And you run either of the following queries against table1: {{select distinct dir0 from dfs.`/table`; select dir0 from dfs.`/table` group by dir0;}} Drill only scans dir0 (/table1/2016) instead of scanning all three of the directories. Thanks, Bridget > Queries on partitioned columns scan the entire datasets > --- > > Key: DRILL-7038 > URL: https://issues.apache.org/jira/browse/DRILL-7038 > Project: Apache Drill > Issue Type: Improvement >Reporter: Bohdan Kazydub >Assignee: Bohdan Kazydub >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.16.0 > > > For tables with hive-style partitions like > {code} > /table/2018/Q1 > /table/2018/Q2 > /table/2019/Q1 > etc. > {code} > if any of the following queries is run: > {code} > select distinct dir0 from dfs.`/table` > {code} > {code} > select dir0 from dfs.`/table` group by dir0 > {code} > it will actually scan every single record in the table rather than just > getting a list of directories at the dir0 level. This applies even when > cached metadata is available. This is a big penalty especially as the > datasets grow. > To avoid such situations, a logical prune rule can be used to collect > partition columns (`dir0`), either from metadata cache (if available) or > group scan, and drop unnecessary files from being read. The rule will be > applied on following conditions: > 1) all queried columns are partitoin columns, and > 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)