[ https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808416#comment-16808416 ]
Bohdan Kazydub commented on DRILL-7038: --------------------------------------- Hi, [~bbevens]. I think it's OK, but I think it is needed to specify that additionally for {{DISTINCT}} or {{GROUP BY}} operation the query has to query ({{SELECT}}) partition columns (dir0, dir1,..., dirN) only. > Queries on partitioned columns scan the entire datasets > ------------------------------------------------------- > > Key: DRILL-7038 > URL: https://issues.apache.org/jira/browse/DRILL-7038 > Project: Apache Drill > Issue Type: Improvement > Reporter: Bohdan Kazydub > Assignee: Bohdan Kazydub > Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.16.0 > > > For tables with hive-style partitions like > {code} > /table/2018/Q1 > /table/2018/Q2 > /table/2019/Q1 > etc. > {code} > if any of the following queries is run: > {code} > select distinct dir0 from dfs.`/table` > {code} > {code} > select dir0 from dfs.`/table` group by dir0 > {code} > it will actually scan every single record in the table rather than just > getting a list of directories at the dir0 level. This applies even when > cached metadata is available. This is a big penalty especially as the > datasets grow. > To avoid such situations, a logical prune rule can be used to collect > partition columns (`dir0`), either from metadata cache (if available) or > group scan, and drop unnecessary files from being read. The rule will be > applied on following conditions: > 1) all queried columns are partitoin columns, and > 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)