[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259189#comment-15259189 ] Dechang Gu commented on DRILL-4589: --- Verified on ucs cluster with the test case. Overall, we see average 2x improvement in explain plan time, some got 5x improvement, comparing with the fix and without the fix: https://docs.google.com/spreadsheets/d/1dh7w8yvQ4fHTt0Bcb9xROqeXU80REjBoPZNSyff49Bg/edit#gid=0 UCS 10-node cluster MapR 4.0 (1 master, 10 data nodes) - 32 cores, 256G RAM, 10 disks, 208K parquet files (each 12KB) in 3-level directory structure Query Without 4589 Apache Drill 1.7.0 master GitId 9514cbe Exec Time (ms) With 4589 Apache Drill 1.7.0 master GitId 9f4fff8 Exec Time (ms) with 4589 vs w/o 4589 Run 1 Run 2 Run 3 Avg Query Time Run 1 Run 2 Run 3Avg Query Time diff in avg wo4589/with4589 (avg) wo4589/with4589 (best) DRILL4589_EXPPLAN_0121478 15491 14784 17251 863484319673 891383381.941.75 DRILL4589_EXPPLAN_0219168 15560 15168 16632 10391 10665 8343 980068321.701.82 DRILL4589_EXPPLAN_0315478 13606 14506 14530 932384129520 908554451.601.62 DRILL4589_EXPPLAN_0418792 15311 14197 16100 856285257720 826978311.951.84 DRILL4589_EXPPLAN_0518447 14852 14692 15997 933386007874 860273951.861.87 DRILL4589_EXPPLAN_0618249 14619 15113 15994 944081339474 901669781.771.80 DRILL4589_EXPPLAN_0717213 15377 14132 15574 819678508066 803775371.941.80 DRILL4589_EXPPLAN_0815884 13808 16767 15486 880582127978 833271551.861.73 DRILL4589_EXPPLAN_0914810 15947 14151 14969 861284718847 864363261.731.67 DRILL4589_EXPPLAN_1015995 15373 16091 15820 954188798203 887469451.781.87 DRILL4589_EXPPLAN_1118722 18239 18828 18596 967780407883 853310063 2.182.31 DRILL4589_EXPPLAN_1216725 16246 16772 16581 844278888285 820583762.022.06 DRILL4589_EXPPLAN_1317063 13647 15686 15465 928480509015 878366821.761.70 DRILL4589_EXPPLAN_1414831 15107 14873 14937 895493368944 907858591.651.66 DRILL4589_EXPPLAN_1515170 15548 15166 15295 889787398891 884264521.731.74 DRILL4589_EXPPLAN_1644969 41579 41880 42809 10238 97779029 968133128 4.424.61 DRILL4589_EXPPLAN_1743389
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241661#comment-15241661 ] Khurram Faraaz commented on DRILL-4589: --- The following tests will be executed to verify this change. {noformat} There are 25 directories (1990 THROUGH 2015), and each directory has 4 sub directories (Q1, Q2, Q3 and Q4) and each of those sub directories has 2000 parquet files (each being ~2KB in size) REFRESH TABLE METADATA `DRILL_4589` will be executed over the root directory and tests similar to those listed below (and more) will be executed. explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 IS NOT NULL; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 IS NULL; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 >= 25 AND c1 <= 135; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 >= 53; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 <= 97; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 >= 25 AND c1 < 135; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 > 25 AND c1 <= 135; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 > 25 AND c1 < 135; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c4 LIKE 'orb%'; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c4 LIKE 'orb%' AND c7 = '1958-04-24'; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c4 IN (...) explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND LENGTH(c5) >= 1 AND LENGTH(c5) <= 172; {noformat} > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238311#comment-15238311 ] ASF GitHub Bot commented on DRILL-4589: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/468 > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236181#comment-15236181 ] Jinfeng Ni commented on DRILL-4589: --- Copied comment from pull request. This patch does not reduce the cost of filter evaluation per row. Rather, it reduces the number of rows on which the filter evaluation is performed. > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233843#comment-15233843 ] ASF GitHub Bot commented on DRILL-4589: --- Github user hsuanyi commented on the pull request: https://github.com/apache/drill/pull/468#issuecomment-207904379 LGTM +1 > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231304#comment-15231304 ] ASF GitHub Bot commented on DRILL-4589: --- Github user amansinha100 commented on the pull request: https://github.com/apache/drill/pull/468#issuecomment-207132298 LGTM. +1. Since this fix is not directly changing the cost of per-row filter evaluation itself (it is reducing the number of rows on which filter evaluation is performed) you might want to clarify that in the commit message or the JIRA. > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231233#comment-15231233 ] ASF GitHub Bot commented on DRILL-4589: --- Github user jinfengni commented on the pull request: https://github.com/apache/drill/pull/468#issuecomment-207115616 @amansinha100 and @hsuanyi , I revised PR based on your comments. Can you take another look? thx. > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231195#comment-15231195 ] ASF GitHub Bot commented on DRILL-4589: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/468#discussion_r58953148 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java --- @@ -0,0 +1,70 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Class defines a single partition corresponding to a directory in a DFS table. + */ +package org.apache.drill.exec.planner; + + +import com.google.common.collect.Lists; + +import java.util.Collection; +import java.util.List; + +public class DFSDirPartitionLocation implements PartitionLocation { + private final Collection subPartitions; + private final String[] dirs; + + public DFSDirPartitionLocation(String[] dirs, Collection subPartitions) { +this.subPartitions = subPartitions; +this.dirs = dirs; + } + + @Override + public String getPartitionValue(int index) { +assert index < dirs.length; +return dirs[index]; + } + + @Override + public String getEntirePartitionLocation() { +throw new UnsupportedOperationException("Should not call getEntirePartitionLocation for composite partition location!"); + } + + @Override + public List getPartitionLocationRecursive() { --- End diff -- I changed this method, such that now it returns list of SimplePartitionLocation. This method would return all SimplePartitionLocation it consists of. In your example, it would return 4 DFSFilePartitionLocations, if it's called at the DFSDirPartitionLocation corresponding to '2016'. This method is used when we construct a GroupScan after pruning, since only SimplePartitionLocation keeps track the entire path, which is required by a groupscan specification. The file partition location keeps track of full path (which would be used when created groupscan) and the partition keys. The dir keeps track the nested partition, and the common partition keys. > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231114#comment-15231114 ] ASF GitHub Bot commented on DRILL-4589: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/468#discussion_r58949303 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java --- @@ -0,0 +1,70 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Class defines a single partition corresponding to a directory in a DFS table. + */ +package org.apache.drill.exec.planner; + + +import com.google.common.collect.Lists; + +import java.util.Collection; +import java.util.List; + +public class DFSDirPartitionLocation implements PartitionLocation { + private final Collection subPartitions; --- End diff -- Yes, it could be mix of directory partition locations and file partition locations, similar to directory / file structures. Add comments to explain. > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230836#comment-15230836 ] ASF GitHub Bot commented on DRILL-4589: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/468#discussion_r58926583 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java --- @@ -0,0 +1,70 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Class defines a single partition corresponding to a directory in a DFS table. + */ +package org.apache.drill.exec.planner; + + +import com.google.common.collect.Lists; + +import java.util.Collection; +import java.util.List; + +public class DFSDirPartitionLocation implements PartitionLocation { + private final Collection subPartitions; + private final String[] dirs; + + public DFSDirPartitionLocation(String[] dirs, Collection subPartitions) { +this.subPartitions = subPartitions; +this.dirs = dirs; + } + + @Override + public String getPartitionValue(int index) { +assert index < dirs.length; +return dirs[index]; + } + + @Override + public String getEntirePartitionLocation() { +throw new UnsupportedOperationException("Should not call getEntirePartitionLocation for composite partition location!"); + } + + @Override + public List getPartitionLocationRecursive() { --- End diff -- It's not completely clear to me what is the expected output of this method for a directory structure such as: 2016/Q1/Jan/1.parquet, 2.parquet 2016/Q1/Feb/1.parquet. 2.parquet ... If called at each nesting level, does it create the full path or relative path of the directory ? > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230778#comment-15230778 ] ASF GitHub Bot commented on DRILL-4589: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/468#discussion_r58921685 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/FileSystemPartitionDescriptor.java --- @@ -148,13 +139,41 @@ public String getName(int index) { return partitionLabel + index; } - private String getBaseTableLocation() { + protected String getBaseTableLocation() { final FormatSelection origSelection = (FormatSelection) table.getSelection(); return origSelection.getSelection().selectionRoot; } @Override protected void createPartitionSublists() { +final Collection fileLocations = getFileLocations(); +List locations = new LinkedList<>(); + +final String selectionRoot = getBaseTableLocation(); + +HashMapdirToFileMap = new HashMap<>(); --- End diff -- Can you add a comment here with an example
pair ? > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230775#comment-15230775 ] ASF GitHub Bot commented on DRILL-4589: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/468#discussion_r58921279 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java --- @@ -0,0 +1,70 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Class defines a single partition corresponding to a directory in a DFS table. + */ +package org.apache.drill.exec.planner; + + +import com.google.common.collect.Lists; + +import java.util.Collection; +import java.util.List; + +public class DFSDirPartitionLocation implements PartitionLocation { + private final Collection subPartitions; --- End diff -- Can this collection be a mix of directory partition locations as well as file partition locations ? It has become a little confusing to keep track of the distinction between the two since the term PartitionLocation is overloaded. Can you add appropriate javadoc to clarify ? > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230667#comment-15230667 ] ASF GitHub Bot commented on DRILL-4589: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/468#discussion_r58912269 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java --- @@ -0,0 +1,70 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Class defines a single partition corresponding to a directory in a DFS table. + */ +package org.apache.drill.exec.planner; + + +import com.google.common.collect.Lists; + +import java.util.Collection; +import java.util.List; + +public class DFSDirPartitionLocation implements PartitionLocation { + private final Collection subPartitions; + private final String[] dirs; + + public DFSDirPartitionLocation(String[] dirs, Collection subPartitions) { +this.subPartitions = subPartitions; +this.dirs = dirs; + } + + @Override + public String getPartitionValue(int index) { +assert index < dirs.length; --- End diff -- this one actually is copied from [1]. I think it makes sense to change both to throw exception in stead of relying on assertion check. Will update the patch. https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSPartitionLocation.java#L58 > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230627#comment-15230627 ] ASF GitHub Bot commented on DRILL-4589: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/468#discussion_r58911205 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/FileSystemPartitionDescriptor.java --- @@ -148,13 +139,41 @@ public String getName(int index) { return partitionLabel + index; } - private String getBaseTableLocation() { + protected String getBaseTableLocation() { --- End diff -- You are right. it should remain as private. Originally, I intended to extend this class. But I decided to remove that part of code from this PR. Will update the patch. > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229785#comment-15229785 ] ASF GitHub Bot commented on DRILL-4589: --- Github user hsuanyi commented on a diff in the pull request: https://github.com/apache/drill/pull/468#discussion_r58824977 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java --- @@ -0,0 +1,70 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Class defines a single partition corresponding to a directory in a DFS table. + */ +package org.apache.drill.exec.planner; + + +import com.google.common.collect.Lists; + +import java.util.Collection; +import java.util.List; + +public class DFSDirPartitionLocation implements PartitionLocation { + private final Collection subPartitions; + private final String[] dirs; + + public DFSDirPartitionLocation(String[] dirs, Collection subPartitions) { +this.subPartitions = subPartitions; +this.dirs = dirs; + } + + @Override + public String getPartitionValue(int index) { +assert index < dirs.length; --- End diff -- I think the next line will throw IOOB if this line is not satisfied. (But this is minor thing). > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229782#comment-15229782 ] ASF GitHub Bot commented on DRILL-4589: --- Github user hsuanyi commented on a diff in the pull request: https://github.com/apache/drill/pull/468#discussion_r58824883 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/FileSystemPartitionDescriptor.java --- @@ -148,13 +139,41 @@ public String getName(int index) { return partitionLabel + index; } - private String getBaseTableLocation() { + protected String getBaseTableLocation() { --- End diff -- I do not find this method being used outside this class. Is it intentional? > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229300#comment-15229300 ] ASF GitHub Bot commented on DRILL-4589: --- GitHub user jinfengni opened a pull request: https://github.com/apache/drill/pull/468 DRILL-4589: Reduce planning time for file system partition pruning by… … reducing filter evaluation overhead You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinfengni/incubator-drill DRILL-4589 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/468.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #468 commit e207a926e65cd788700229de3ae47cf4e876 Author: Jinfeng NiDate: 2016-02-25T18:13:43Z DRILL-4589: Reduce planning time for file system partition pruning by reducing filter evaluation overhead > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229302#comment-15229302 ] ASF GitHub Bot commented on DRILL-4589: --- Github user jinfengni commented on the pull request: https://github.com/apache/drill/pull/468#issuecomment-206611168 @amansinha100 , could you please review this PR? thanks! > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229298#comment-15229298 ] Jinfeng Ni commented on DRILL-4589: --- I have a patch for this JIRA. Using the same dataset used in the comparison done in DRILL-2517(With 115k parquet files in total, it's organized in 25 directories (1990, 1991, ... ), and each directory has four subdirectories (Q1, Q2, Q3, Q4).), here is the query planning time measured on a mac laptop. {code} explain plan for select * from dfs.`/drill/testdata/tpch-sf10/lineitem115k` where dir0 = '1990' and dir1 = 'Q1'; {code} Without the patch (on today's master branch: {code} 1 row selected (8.084 seconds) {code} With the patch {code} 1 row selected (4.306 seconds) {code} If the partition filter contains complex expression, then the improvement percentage is even higher. For this query, the improvement is 24.951 seconds vs. 4.393 seconds {code} explain plan for select * from dfs.`/drill/testdata/tpch-sf10/lineitem115k` where concat(substr(dir0, 1, 4), substr(dir1, 1, 2)) = '1990Q1'; {code} > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead
[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229107#comment-15229107 ] Jinfeng Ni commented on DRILL-4589: --- This is related to DRILL-3759, which targets for multi-phased partition pruning. Both of them aim to improve the efficiency of partition pruning in drill's query planner. > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > - > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Reporter: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)