[ https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anton Gozhiy closed DRILL-7038. ------------------------------- Verified with Drill version 1.16.0-SNAPSHOT (commit 494c2060a385408f27185949e6899a9017b6b7ff) Cases tested: # Group by # Distinct # Order by # Presence of files aside the partition folders # Different file formats # Negative cases (agg functions, having clause, limit etc) > Queries on partitioned columns scan the entire datasets > ------------------------------------------------------- > > Key: DRILL-7038 > URL: https://issues.apache.org/jira/browse/DRILL-7038 > Project: Apache Drill > Issue Type: Improvement > Reporter: Bohdan Kazydub > Assignee: Bohdan Kazydub > Priority: Major > Labels: doc-complete, ready-to-commit > Fix For: 1.16.0 > > > For tables with hive-style partitions like > {code} > /table/2018/Q1 > /table/2018/Q2 > /table/2019/Q1 > etc. > {code} > if any of the following queries is run: > {code} > select distinct dir0 from dfs.`/table` > {code} > {code} > select dir0 from dfs.`/table` group by dir0 > {code} > it will actually scan every single record in the table rather than just > getting a list of directories at the dir0 level. This applies even when > cached metadata is available. This is a big penalty especially as the > datasets grow. > To avoid such situations, a logical prune rule can be used to collect > partition columns (`dir0`), either from metadata cache (if available) or > group scan, and drop unnecessary files from being read. The rule will be > applied on following conditions: > 1) all queried columns are partitoin columns, and > 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)