[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-26 Thread Dechang Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259189#comment-15259189
 ] 

Dechang Gu commented on DRILL-4589:
---

Verified on ucs cluster with the test case.   Overall, we see average 2x 
improvement in explain plan time, some got 5x improvement, comparing with the 
fix and without the fix:

https://docs.google.com/spreadsheets/d/1dh7w8yvQ4fHTt0Bcb9xROqeXU80REjBoPZNSyff49Bg/edit#gid=0


UCS 10-node cluster MapR 4.0 (1 master, 10 data nodes) - 32 cores, 256G RAM, 10 
disks,   208K parquet files (each 12KB) in 3-level directory structure  



Query   Without 4589 Apache Drill 1.7.0 master GitId 9514cbe  Exec Time (ms)
With 4589 Apache Drill 1.7.0 master GitId 9f4fff8  Exec 
Time (ms)   with 4589 vs w/o 4589   


Run 1   Run 2   Run 3   Avg Query Time  Run 1   Run 2   Run 3Avg 
Query Time diff in avg wo4589/with4589 (avg)   wo4589/with4589 (best)   

   
DRILL4589_EXPPLAN_0121478   15491   14784   17251   863484319673
891383381.941.75

DRILL4589_EXPPLAN_0219168   15560   15168   16632   10391   10665   8343
980068321.701.82

DRILL4589_EXPPLAN_0315478   13606   14506   14530   932384129520
908554451.601.62

DRILL4589_EXPPLAN_0418792   15311   14197   16100   856285257720
826978311.951.84

DRILL4589_EXPPLAN_0518447   14852   14692   15997   933386007874
860273951.861.87

DRILL4589_EXPPLAN_0618249   14619   15113   15994   944081339474
901669781.771.80

DRILL4589_EXPPLAN_0717213   15377   14132   15574   819678508066
803775371.941.80

DRILL4589_EXPPLAN_0815884   13808   16767   15486   880582127978
833271551.861.73

DRILL4589_EXPPLAN_0914810   15947   14151   14969   861284718847
864363261.731.67

DRILL4589_EXPPLAN_1015995   15373   16091   15820   954188798203
887469451.781.87

DRILL4589_EXPPLAN_1118722   18239   18828   18596   967780407883
853310063   2.182.31

DRILL4589_EXPPLAN_1216725   16246   16772   16581   844278888285
820583762.022.06

DRILL4589_EXPPLAN_1317063   13647   15686   15465   928480509015
878366821.761.70

DRILL4589_EXPPLAN_1414831   15107   14873   14937   895493368944
907858591.651.66

DRILL4589_EXPPLAN_1515170   15548   15166   15295   889787398891
884264521.731.74

DRILL4589_EXPPLAN_1644969   41579   41880   42809   10238   97779029
968133128   4.424.61

DRILL4589_EXPPLAN_1743389 

[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-14 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241661#comment-15241661
 ] 

Khurram Faraaz commented on DRILL-4589:
---

The following tests will be executed to verify this change.

{noformat}
There are 25 directories (1990 THROUGH 2015), and each directory has 4 sub 
directories (Q1, Q2, Q3 and Q4)
and each of those sub directories has 2000 parquet files (each being ~2KB in 
size)

REFRESH TABLE METADATA `DRILL_4589`
will be executed over the root directory and tests similar to those listed 
below (and more) will be executed.

explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 IS NOT NULL;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 IS NULL;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 >= 25 AND c1 <= 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 >= 53;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 <= 97;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 >= 25 AND c1 < 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 > 25 AND c1 <= 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 > 25 AND c1 < 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c4 LIKE 'orb%';
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c4 LIKE 'orb%' AND c7 = '1958-04-24';
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c4 IN (...)
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
LENGTH(c5) >= 1 AND LENGTH(c5) <= 172;
{noformat}

> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238311#comment-15238311
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/468


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-11 Thread Jinfeng Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236181#comment-15236181
 ] 

Jinfeng Ni commented on DRILL-4589:
---

Copied comment from pull request.

This patch does not reduce the cost of filter evaluation per row. Rather, it 
reduces the number of rows on which the filter evaluation is performed.


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233843#comment-15233843
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user hsuanyi commented on the pull request:

https://github.com/apache/drill/pull/468#issuecomment-207904379
  
LGTM +1


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231304#comment-15231304
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user amansinha100 commented on the pull request:

https://github.com/apache/drill/pull/468#issuecomment-207132298
  
LGTM.  +1.   Since this fix is not directly changing the cost of per-row 
filter evaluation itself (it is reducing the number of rows on which filter 
evaluation is performed) you might want to clarify that in the commit message 
or the JIRA. 


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231233#comment-15231233
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user jinfengni commented on the pull request:

https://github.com/apache/drill/pull/468#issuecomment-207115616
  
@amansinha100 and @hsuanyi , I revised PR based on your comments. Can you 
take another look? thx.




> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231195#comment-15231195
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/468#discussion_r58953148
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java
 ---
@@ -0,0 +1,70 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Class defines a single partition corresponding to a directory in a DFS 
table.
+ */
+package org.apache.drill.exec.planner;
+
+
+import com.google.common.collect.Lists;
+
+import java.util.Collection;
+import java.util.List;
+
+public class DFSDirPartitionLocation implements PartitionLocation {
+  private final Collection subPartitions;
+  private final String[] dirs;
+
+  public DFSDirPartitionLocation(String[] dirs, 
Collection subPartitions) {
+this.subPartitions = subPartitions;
+this.dirs = dirs;
+  }
+
+  @Override
+  public String getPartitionValue(int index) {
+assert index < dirs.length;
+return dirs[index];
+  }
+
+  @Override
+  public String getEntirePartitionLocation() {
+throw new UnsupportedOperationException("Should not call 
getEntirePartitionLocation for composite partition location!");
+  }
+
+  @Override
+  public List getPartitionLocationRecursive() {
--- End diff --

I changed this method, such that now it returns list of 
SimplePartitionLocation.  This method would return all SimplePartitionLocation 
it consists of. In your example, it would return 4 DFSFilePartitionLocations, 
if it's called at the DFSDirPartitionLocation corresponding to '2016'. This 
method is used when we construct a GroupScan after pruning, since only 
SimplePartitionLocation keeps track the entire path, which is required by a 
groupscan specification.

The file partition location keeps track of full path (which would be used 
when created groupscan) and the partition keys. The dir keeps track the nested 
partition, and the common partition keys.




> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231114#comment-15231114
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/468#discussion_r58949303
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java
 ---
@@ -0,0 +1,70 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Class defines a single partition corresponding to a directory in a DFS 
table.
+ */
+package org.apache.drill.exec.planner;
+
+
+import com.google.common.collect.Lists;
+
+import java.util.Collection;
+import java.util.List;
+
+public class DFSDirPartitionLocation implements PartitionLocation {
+  private final Collection subPartitions;
--- End diff --

Yes, it could be mix of directory partition locations and file partition 
locations, similar to directory / file structures. 

Add comments to explain. 


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230836#comment-15230836
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/468#discussion_r58926583
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java
 ---
@@ -0,0 +1,70 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Class defines a single partition corresponding to a directory in a DFS 
table.
+ */
+package org.apache.drill.exec.planner;
+
+
+import com.google.common.collect.Lists;
+
+import java.util.Collection;
+import java.util.List;
+
+public class DFSDirPartitionLocation implements PartitionLocation {
+  private final Collection subPartitions;
+  private final String[] dirs;
+
+  public DFSDirPartitionLocation(String[] dirs, 
Collection subPartitions) {
+this.subPartitions = subPartitions;
+this.dirs = dirs;
+  }
+
+  @Override
+  public String getPartitionValue(int index) {
+assert index < dirs.length;
+return dirs[index];
+  }
+
+  @Override
+  public String getEntirePartitionLocation() {
+throw new UnsupportedOperationException("Should not call 
getEntirePartitionLocation for composite partition location!");
+  }
+
+  @Override
+  public List getPartitionLocationRecursive() {
--- End diff --

It's not completely clear to me what is the expected output of this method 
for a directory structure such as: 
2016/Q1/Jan/1.parquet, 2.parquet
2016/Q1/Feb/1.parquet. 2.parquet
...
If called at each nesting level, does it create the full path or relative 
path of the directory ? 


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230778#comment-15230778
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/468#discussion_r58921685
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/FileSystemPartitionDescriptor.java
 ---
@@ -148,13 +139,41 @@ public String getName(int index) {
 return partitionLabel + index;
   }
 
-  private String getBaseTableLocation() {
+  protected String getBaseTableLocation() {
 final FormatSelection origSelection = (FormatSelection) 
table.getSelection();
 return origSelection.getSelection().selectionRoot;
   }
 
   @Override
   protected void createPartitionSublists() {
+final Collection fileLocations = getFileLocations();
+List locations = new LinkedList<>();
+
+final String selectionRoot = getBaseTableLocation();
+
+HashMap dirToFileMap = new 
HashMap<>();
--- End diff --

Can you add a comment here with an example  pair ?  


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230775#comment-15230775
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/468#discussion_r58921279
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java
 ---
@@ -0,0 +1,70 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Class defines a single partition corresponding to a directory in a DFS 
table.
+ */
+package org.apache.drill.exec.planner;
+
+
+import com.google.common.collect.Lists;
+
+import java.util.Collection;
+import java.util.List;
+
+public class DFSDirPartitionLocation implements PartitionLocation {
+  private final Collection subPartitions;
--- End diff --

Can this collection be a mix of directory partition locations as well as 
file partition locations ?  It has become a little confusing to keep track of 
the distinction between the two since the term PartitionLocation is overloaded. 
  Can you add appropriate javadoc to clarify ? 


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230667#comment-15230667
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/468#discussion_r58912269
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java
 ---
@@ -0,0 +1,70 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Class defines a single partition corresponding to a directory in a DFS 
table.
+ */
+package org.apache.drill.exec.planner;
+
+
+import com.google.common.collect.Lists;
+
+import java.util.Collection;
+import java.util.List;
+
+public class DFSDirPartitionLocation implements PartitionLocation {
+  private final Collection subPartitions;
+  private final String[] dirs;
+
+  public DFSDirPartitionLocation(String[] dirs, 
Collection subPartitions) {
+this.subPartitions = subPartitions;
+this.dirs = dirs;
+  }
+
+  @Override
+  public String getPartitionValue(int index) {
+assert index < dirs.length;
--- End diff --

this one actually is copied from [1]. I think it makes sense to change both 
to throw exception in stead of relying on assertion check. Will update the 
patch. 



https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSPartitionLocation.java#L58


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230627#comment-15230627
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/468#discussion_r58911205
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/FileSystemPartitionDescriptor.java
 ---
@@ -148,13 +139,41 @@ public String getName(int index) {
 return partitionLabel + index;
   }
 
-  private String getBaseTableLocation() {
+  protected String getBaseTableLocation() {
--- End diff --

You are right.  it should remain as private. Originally, I intended to 
extend this class. But I decided to remove that part of code from this PR. Will 
update the patch. 



> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229785#comment-15229785
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user hsuanyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/468#discussion_r58824977
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/DFSDirPartitionLocation.java
 ---
@@ -0,0 +1,70 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Class defines a single partition corresponding to a directory in a DFS 
table.
+ */
+package org.apache.drill.exec.planner;
+
+
+import com.google.common.collect.Lists;
+
+import java.util.Collection;
+import java.util.List;
+
+public class DFSDirPartitionLocation implements PartitionLocation {
+  private final Collection subPartitions;
+  private final String[] dirs;
+
+  public DFSDirPartitionLocation(String[] dirs, 
Collection subPartitions) {
+this.subPartitions = subPartitions;
+this.dirs = dirs;
+  }
+
+  @Override
+  public String getPartitionValue(int index) {
+assert index < dirs.length;
--- End diff --

I think the next line will throw IOOB if this line is not satisfied. 
(But this is minor thing). 


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229782#comment-15229782
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user hsuanyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/468#discussion_r58824883
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/FileSystemPartitionDescriptor.java
 ---
@@ -148,13 +139,41 @@ public String getName(int index) {
 return partitionLabel + index;
   }
 
-  private String getBaseTableLocation() {
+  protected String getBaseTableLocation() {
--- End diff --

I do not find this method being used outside this class. Is it intentional?


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229300#comment-15229300
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

GitHub user jinfengni opened a pull request:

https://github.com/apache/drill/pull/468

DRILL-4589: Reduce planning time for file system partition pruning by…

… reducing filter evaluation overhead

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jinfengni/incubator-drill DRILL-4589

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/468.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #468


commit e207a926e65cd788700229de3ae47cf4e876
Author: Jinfeng Ni 
Date:   2016-02-25T18:13:43Z

DRILL-4589: Reduce planning time for file system partition pruning by 
reducing filter evaluation overhead




> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229302#comment-15229302
 ] 

ASF GitHub Bot commented on DRILL-4589:
---

Github user jinfengni commented on the pull request:

https://github.com/apache/drill/pull/468#issuecomment-206611168
  
@amansinha100 , could you please review this PR? thanks!



> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-06 Thread Jinfeng Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229298#comment-15229298
 ] 

Jinfeng Ni commented on DRILL-4589:
---

I have a patch for this JIRA. Using the same dataset used in the comparison 
done in DRILL-2517(With 115k parquet files in total, it's organized in 25 
directories (1990, 1991, ... ), and each directory has four subdirectories (Q1, 
Q2, Q3, Q4).), here is the query planning time measured on a mac laptop. 

{code}
explain plan for select * from dfs.`/drill/testdata/tpch-sf10/lineitem115k` 
where dir0 = '1990' and dir1 = 'Q1';
{code}

Without the patch (on today's master branch:
{code}
1 row selected (8.084 seconds)
{code}

With the patch
{code}
1 row selected (4.306 seconds)
{code}

If the partition filter contains complex expression, then the improvement 
percentage is even higher. For this query, the improvement is 24.951 seconds 
vs. 4.393 seconds
{code}
explain plan for select * from dfs.`/drill/testdata/tpch-sf10/lineitem115k` 
where concat(substr(dir0, 1, 4), substr(dir1, 1, 2)) = '1990Q1';
{code} 




> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4589) Reduce planning time for file system partition pruning by reducing filter evaluation overhead

2016-04-06 Thread Jinfeng Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229107#comment-15229107
 ] 

Jinfeng Ni commented on DRILL-4589:
---

This is related to DRILL-3759, which targets for multi-phased partition 
pruning. Both of them aim to improve the efficiency of partition pruning in 
drill's query planner.

 

> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> -
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)