[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets

Bohdan Kazydub (JIRA) Thu, 28 Mar 2019 01:28:21 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803689#comment-16803689
 ]


Bohdan Kazydub commented on DRILL-7038:
---------------------------------------

Hi, [~bbevens]. No, it's not like that. Those {{dir0}}, {{dir1}}, ... columns 
refer to directory levels from root directory (see [Querying 
Directories|https://drill.apache.org/docs/querying-directories/]).  For 
example, if {{table1}} had following directory structure:
{code}
/table1/2016/Q1
/table1/2016/Q2
...
{code}
and when querying
{code}
select distinct dir0[, dir1[,...]] from dfs.`/table1`;
select dir0[, dir1[,...]] from dfs.`/table1` group by dir0;
{code}
{{dir0}} references first level directories from `table1` (which is root), i.e. 
'2016' directory, {{dir1}} references second level directories 'Q1' and 'Q2' 
and so on.

Before, Drill was scanning all the *files* in all directories. With this 
optimization, file scanning is discarded and Scan operator is replaced with 
Values operator containing literal values, with this values being collected 
from directory metadata cache file (if exists) or from scan file selection.

> Queries on partitioned columns scan the entire datasets
> -------------------------------------------------------
>
>                 Key: DRILL-7038
>                 URL: https://issues.apache.org/jira/browse/DRILL-7038
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Bohdan Kazydub
>            Assignee: Bohdan Kazydub
>            Priority: Major
>              Labels: doc-impacting, ready-to-commit
>             Fix For: 1.16.0
>
>
> For tables with hive-style partitions like
> {code}
> /table/2018/Q1
> /table/2018/Q2
> /table/2019/Q1
> etc.
> {code}
> if any of the following queries is run:
> {code}
> select distinct dir0 from dfs.`/table`
> {code}
> {code}
> select dir0 from dfs.`/table` group by dir0
> {code}
> it will actually scan every single record in the table rather than just 
> getting a list of directories at the dir0 level. This applies even when 
> cached metadata is available. This is a big penalty especially as the 
> datasets grow.
> To avoid such situations, a logical prune rule can be used to collect 
> partition columns (`dir0`), either from metadata cache (if available) or 
> group scan, and drop unnecessary files from being read. The rule will be 
> applied on following conditions:
> 1) all queried columns are partitoin columns, and
> 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-7038) Queries on partitioned columns scan the entire datasets

Reply via email to