[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/519


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-15 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r71059510
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -143,6 +144,11 @@ public static final RelOptRule 
getDirFilterOnScan(OptimizerRulesContext optimize
   }
 
   protected void doOnMatch(RelOptRuleCall call, Filter filterRel, Project 
projectRel, TableScan scanRel) {
+if (wasAllPartitionsPruned) {
--- End diff --

Pls see commit #ce509af for the changes.  Since the 'all partitions pruned' 
is a special case, I had to keep track of this state and propagate it during 
the new FileSelection creation after partition pruning. Thus, if the first 
phase directory pruning eliminated all partitions and we re-added 1 for schema 
purposes, this directory selection will be  eventually seen by the 
ParquetGroupScan which will examine a flag 'wasAllPartitionsPruned' and decide 
to retrieve only 1 file from this directory.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-15 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r71059388
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/FileSystemPartitionDescriptor.java
 ---
@@ -212,8 +223,12 @@ public TableScan 
createTableScan(List newPartitionLocation) t
 }
 
 if (scanRel instanceof DrillScanRel) {
-  final FileSelection newFileSelection = new FileSelection(null, 
newFiles, getBaseTableLocation());
-  final FileGroupScan newGroupScan = 
((FileGroupScan)((DrillScanRel)scanRel).getGroupScan()).clone(newFileSelection);
+//  final FormatSelection formatSelection = 
(FormatSelection)((DynamicDrillTable)scanRel.getTable()).getSelection();
+  final FormatSelection formatSelection = 
(FormatSelection)table.getSelection();
+  final FileSelection newFileSelection = new FileSelection(null, 
newFiles, getBaseTableLocation(),
+  cacheFileRoot, formatSelection.getSelection().getDirStatus());
+  final FileGroupScan newGroupScan =
+  
((FileGroupScan)((DrillScanRel)scanRel).getGroupScan()).clone(newFileSelection, 
cacheFileRoot);
--- End diff --

Fixed in commit #7ae8f7a


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-15 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r71059395
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -211,9 +217,76 @@ public void testNoSupportedError() throws Exception {
 .go();
   }
 
+  @Test // DRILL-4530
+  public void testDrill4530_1() throws Exception {
+// create metadata cache
+test(String.format("refresh table metadata dfs_test.`%s/%s`", 
getDfsTestTmpSchemaLocation(), tableName2));
+checkForMetadataFile(tableName2);
+
+// run query and check correctness
+String query1 = String.format("select dir0, dir1, o_custkey, 
o_orderdate from dfs_test.`%s/%s` " +
+" where dir0=1995 and dir1='Q3'",
+getDfsTestTmpSchemaLocation(), tableName2);
+int expectedRowCount = 20;
+int expectedNumFiles = 2;
+
+int actualRowCount = testSql(query1);
+assertEquals(expectedRowCount, actualRowCount);
+String numFilesPattern = "numFiles=" + expectedNumFiles;
+String usedMetaPattern = "usedMetadataFile=true";
+String cacheFileRootPattern = String.format("%s/%s/1995/Q3", 
getDfsTestTmpSchemaLocation(), tableName2);
--- End diff --

I have changed the test pattern to match 'cacheFileRoot=' string. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-15 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r71059368
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -607,7 +607,14 @@ public long getRowCount() {
   fileSet = Sets.newHashSet(fileNames);
 }
 
-if (fileNames.isEmpty()) {
+List finalFileNames;
+if (fileSet != null) {
+  finalFileNames = Lists.newArrayList(fileSet);
--- End diff --

Commit #ef37b77 addresses this comment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-13 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70706252
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -143,6 +144,11 @@ public static final RelOptRule 
getDirFilterOnScan(OptimizerRulesContext optimize
   }
 
   protected void doOnMatch(RelOptRuleCall call, Filter filterRel, Project 
projectRel, TableScan scanRel) {
+if (wasAllPartitionsPruned) {
--- End diff --

Agree that this should not be an internal state.  I had added it to 
overcome an issue where the prunescan rule gets re-applied even when there are 
not qualifying partitions (this happens because we always populate the 
newPartitions list with at least 1 entry to produce a valid schema during 
execution).   

I tried removing this flag but will need some associated changes to prevent 
the rule from being reapplied without termination.  The old code (without my 
patch) terminates the pruning because FileSelection.supportDirPruning() returns 
FALSE when the selection has directories.   Whereas, for this optimization I 
want to do the pruning on directories first.   So, let me think of a way to 
address that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-12 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70548891
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -607,7 +607,14 @@ public long getRowCount() {
   fileSet = Sets.newHashSet(fileNames);
 }
 
-if (fileNames.isEmpty()) {
+List finalFileNames;
+if (fileSet != null) {
+  finalFileNames = Lists.newArrayList(fileSet);
--- End diff --

Yes, I can try to keep only the fileSet and get rid of fileNames (although 
I would still need the finalFileNames list since FileSelection only accepts a 
List, not a Set).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-12 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70537956
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -607,7 +607,14 @@ public long getRowCount() {
   fileSet = Sets.newHashSet(fileNames);
 }
 
-if (fileNames.isEmpty()) {
+List finalFileNames;
+if (fileSet != null) {
+  finalFileNames = Lists.newArrayList(fileSet);
--- End diff --

I see the need to remove duplicates and the confusion comes from old code. 
Is it reasonable to change existing code, and only keep fileSet in all cases? 
Keep fileName and fileSet seems to be a bit confusing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-12 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70536259
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/FileSystemPartitionDescriptor.java
 ---
@@ -212,8 +223,12 @@ public TableScan 
createTableScan(List newPartitionLocation) t
 }
 
 if (scanRel instanceof DrillScanRel) {
-  final FileSelection newFileSelection = new FileSelection(null, 
newFiles, getBaseTableLocation());
-  final FileGroupScan newGroupScan = 
((FileGroupScan)((DrillScanRel)scanRel).getGroupScan()).clone(newFileSelection);
+//  final FormatSelection formatSelection = 
(FormatSelection)((DynamicDrillTable)scanRel.getTable()).getSelection();
+  final FormatSelection formatSelection = 
(FormatSelection)table.getSelection();
+  final FileSelection newFileSelection = new FileSelection(null, 
newFiles, getBaseTableLocation(),
+  cacheFileRoot, formatSelection.getSelection().getDirStatus());
+  final FileGroupScan newGroupScan =
+  
((FileGroupScan)((DrillScanRel)scanRel).getGroupScan()).clone(newFileSelection, 
cacheFileRoot);
--- End diff --

I agree it seems redundant second parameter.  I don't recall why I needed 
it at some point  but let me see if I can get rid of it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-12 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70532442
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -211,9 +217,76 @@ public void testNoSupportedError() throws Exception {
 .go();
   }
 
+  @Test // DRILL-4530
+  public void testDrill4530_1() throws Exception {
+// create metadata cache
+test(String.format("refresh table metadata dfs_test.`%s/%s`", 
getDfsTestTmpSchemaLocation(), tableName2));
+checkForMetadataFile(tableName2);
+
+// run query and check correctness
+String query1 = String.format("select dir0, dir1, o_custkey, 
o_orderdate from dfs_test.`%s/%s` " +
+" where dir0=1995 and dir1='Q3'",
+getDfsTestTmpSchemaLocation(), tableName2);
+int expectedRowCount = 20;
+int expectedNumFiles = 2;
+
+int actualRowCount = testSql(query1);
+assertEquals(expectedRowCount, actualRowCount);
+String numFilesPattern = "numFiles=" + expectedNumFiles;
+String usedMetaPattern = "usedMetadataFile=true";
+String cacheFileRootPattern = String.format("%s/%s/1995/Q3", 
getDfsTestTmpSchemaLocation(), tableName2);
--- End diff --

The verification of cacheFileRootPattern probably need put "cacheFileRoot=" 
as the prefix. Otherwise,  the list of files in GroupScan will always find a 
match for cacheFileRoot, right?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-12 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70529269
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -143,6 +144,11 @@ public static final RelOptRule 
getDirFilterOnScan(OptimizerRulesContext optimize
   }
 
   protected void doOnMatch(RelOptRuleCall call, Filter filterRel, Project 
projectRel, TableScan scanRel) {
+if (wasAllPartitionsPruned) {
--- End diff --

I feel this flag may lead to scenarios where pruning does not happen when 
it should.  The flag is PruneScanRule's internal variable. The same rule could 
be applied to pruning for multiple tables.  So, if PruneScanRule turns on this 
flag for T1, then we might skip pruning logic for T2 later on, when this rule 
is fired.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-12 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70527515
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java
 ---
@@ -130,13 +131,13 @@ protected void createPartitionSublists() {
   }
 
   @Override
-  public TableScan createTableScan(List 
newPartitionLocation) throws Exception {
+  public TableScan createTableScan(List 
newPartitionLocation, String cacheFileRoot) throws Exception {
--- End diff --

ParquetPartitionDescriptor does not supportsSinglePartOptimization. Should 
it here throws unsupportedException, and implement 
createTableScan(newPartitionLocation) only? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-12 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70524663
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/FileSystemPartitionDescriptor.java
 ---
@@ -212,8 +223,12 @@ public TableScan 
createTableScan(List newPartitionLocation) t
 }
 
 if (scanRel instanceof DrillScanRel) {
-  final FileSelection newFileSelection = new FileSelection(null, 
newFiles, getBaseTableLocation());
-  final FileGroupScan newGroupScan = 
((FileGroupScan)((DrillScanRel)scanRel).getGroupScan()).clone(newFileSelection);
+//  final FormatSelection formatSelection = 
(FormatSelection)((DynamicDrillTable)scanRel.getTable()).getSelection();
+  final FormatSelection formatSelection = 
(FormatSelection)table.getSelection();
+  final FileSelection newFileSelection = new FileSelection(null, 
newFiles, getBaseTableLocation(),
+  cacheFileRoot, formatSelection.getSelection().getDirStatus());
+  final FileGroupScan newGroupScan =
+  
((FileGroupScan)((DrillScanRel)scanRel).getGroupScan()).clone(newFileSelection, 
cacheFileRoot);
--- End diff --

This newFileSelection already has cacheFileRoot.  Do you think clone() only 
need take FileSelection, since FileSelection has cacheFileRoot infor?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-12 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70524553
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -607,7 +607,14 @@ public long getRowCount() {
   fileSet = Sets.newHashSet(fileNames);
 }
 
-if (fileNames.isEmpty()) {
+List finalFileNames;
+if (fileSet != null) {
+  finalFileNames = Lists.newArrayList(fileSet);
--- End diff --

Actually, fileNames can contain duplicates whereas fileSet has a unique 
list.  fileSet was already present in the code earlier and used by 
removeUnneededRowGroups().  I am leveraging fileSet because when expanding 
directories it may be possible to end up with duplicate files which we don't 
want.   I could perhaps add a comment here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-12 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70514652
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
 int recordCount = 0;
 int qualifiedCount = 0;
 
-// Inner loop: within each batch iterate over the 
PartitionLocations
-for(PartitionLocation part: partitions){
-  if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1){
-newPartitions.add(part);
-qualifiedCount++;
+if (checkForSingle &&
+partitions.get(0).isCompositePartition() /* apply single 
partition check only for composite partitions */) {
+  // Inner loop: within each batch iterate over the 
PartitionLocations
+  for (PartitionLocation part : partitions) {
--- End diff --

yes,  it's ok to put the refactoring effort as an enhancement JIR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-12 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70514381
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -607,7 +607,14 @@ public long getRowCount() {
   fileSet = Sets.newHashSet(fileNames);
 }
 
-if (fileNames.isEmpty()) {
+List finalFileNames;
+if (fileSet != null) {
+  finalFileNames = Lists.newArrayList(fileSet);
--- End diff --

Can we use fileNames only? From Line 589 / 607, seems fileSet is assigned 
from fileNames; seems they are same under two ELSE branches. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-07 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70002787
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
 int recordCount = 0;
 int qualifiedCount = 0;
 
-// Inner loop: within each batch iterate over the 
PartitionLocations
-for(PartitionLocation part: partitions){
-  if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1){
-newPartitions.add(part);
-qualifiedCount++;
+if (checkForSingle &&
+partitions.get(0).isCompositePartition() /* apply single 
partition check only for composite partitions */) {
+  // Inner loop: within each batch iterate over the 
PartitionLocations
+  for (PartitionLocation part : partitions) {
--- End diff --

@jinfengni I started doing the refactoring to move this particular 
optimization into a partition descriptor specific class.  However, I was having 
to propagate several internal states to the doSinglePartOpt() method...such as 
partition map, the referenced dirs bitset , the value vector array and others.  
Also, since this optimization occurs inside an outer loop, it is not 
straightforward without doing a broader restructuring of the 
PruneScanRule.doOnMatch() method.  Would you be ok to defer this refactoring to 
an enhancement JIRA ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-07 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70002368
  
--- Diff: 
contrib/storage-hive/core/src/main/java/org/apache/drill/exec/planner/sql/HivePartitionDescriptor.java
 ---
@@ -151,7 +152,7 @@ protected void createPartitionSublists() {
   }
 
   @Override
-  public TableScan createTableScan(List newPartitions) 
throws Exception {
+  public TableScan createTableScan(List newPartitions, 
String cacheFileRoot) throws Exception {
--- End diff --

Refactored this interface method.  Updated the PR with a separate commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-07-07 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r70002215
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
 int recordCount = 0;
 int qualifiedCount = 0;
 
-// Inner loop: within each batch iterate over the 
PartitionLocations
-for(PartitionLocation part: partitions){
-  if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1){
-newPartitions.add(part);
-qualifiedCount++;
+if (checkForSingle &&
+partitions.get(0).isCompositePartition() /* apply single 
partition check only for composite partitions */) {
+  // Inner loop: within each batch iterate over the 
PartitionLocations
+  for (PartitionLocation part : partitions) {
+assert part.isCompositePartition();
+if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1) {
+  newPartitions.add(part);
+  if (isSinglePartition) { // only need to do this if we are 
already single partition
+// compose the array of partition values for the 
directories that are referenced by filter:
+// e.g suppose the dir hierarchy is year/quarter/month and 
the query is:
+// SELECT * FROM T WHERE dir0=2015 AND dir1 = 'Q1',
+// then for 2015/Q1/Feb, this will have ['2015', 'Q1', 
null]
--- End diff --

I have fixed this case and updated the pull request with a separate commit 
(not squashed yet so you can review it separately). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-28 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r68814702
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
@@ -47,16 +47,25 @@
   private List statuses;
 
   public List files;
+  /**
+   * root path for the selections
+   */
   public final String selectionRoot;
+  /**
+   * root path for the metadata cache file (if any)
+   */
+  public final String cacheFileRoot;
--- End diff --

Thanks for the explanation. It makes sense and I do see why updating 
selectionRoot would cause problems for the queries you mentioned. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-27 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r68678168
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
@@ -47,16 +47,25 @@
   private List statuses;
 
   public List files;
+  /**
+   * root path for the selections
+   */
   public final String selectionRoot;
+  /**
+   * root path for the metadata cache file (if any)
+   */
+  public final String cacheFileRoot;
--- End diff --

That was my initial approach (updating the selectionRoot without keeping a 
separate cacheFileRoot).   However,  I ran into a few issues.  The main one 
that I recall is that the dir0, dir1 etc columns are associated with the 
selectionRoot, so suppose I run the following query: 
SELECT dir0, dir1 FROM dfs.tmp.t2 WHERE dir0=2015 AND dir1='Q1' 
and if the selectionRoot gets modified to point to '2015/Q1'  then we have 
lost the context of the original dir0, dir1 because everything will become 
relative to the new selectionRoot.This produces wrong results.   The same 
problem occurred with a SELECT *  query where the directory columns where not 
showing up correctly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-27 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r68673292
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
 int recordCount = 0;
 int qualifiedCount = 0;
 
-// Inner loop: within each batch iterate over the 
PartitionLocations
-for(PartitionLocation part: partitions){
-  if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1){
-newPartitions.add(part);
-qualifiedCount++;
+if (checkForSingle &&
+partitions.get(0).isCompositePartition() /* apply single 
partition check only for composite partitions */) {
+  // Inner loop: within each batch iterate over the 
PartitionLocations
+  for (PartitionLocation part : partitions) {
+assert part.isCompositePartition();
+if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1) {
+  newPartitions.add(part);
+  if (isSinglePartition) { // only need to do this if we are 
already single partition
+// compose the array of partition values for the 
directories that are referenced by filter:
+// e.g suppose the dir hierarchy is year/quarter/month and 
the query is:
+// SELECT * FROM T WHERE dir0=2015 AND dir1 = 'Q1',
+// then for 2015/Q1/Feb, this will have ['2015', 'Q1', 
null]
--- End diff --

Good point.  The path [2015, null, Jan]  should *not* qualify for the 
single partition optimization because in the general case there could be 
multiple 'Jan' subdirectories under the dir1 directory.  We would still use the 
metadata cache but at the level of the dir0, so ideally one should get 
cacheFileRoot=/tmp/t2/2015.   I have a fix for this and will update the PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-20 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67716686
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
@@ -47,16 +47,25 @@
   private List statuses;
 
   public List files;
+  /**
+   * root path for the selections
+   */
   public final String selectionRoot;
+  /**
+   * root path for the metadata cache file (if any)
+   */
+  public final String cacheFileRoot;
--- End diff --

When singlePartitionOpt is applied,  is it possible to update selectionRoot 
to be cacheFileRoot? That is, we do not maintain cacheFileRoot separately. In 
stead, a FileSelection with updated selectionRoot is used when 
singlePartitionOpt is applied. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-20 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67711244
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
 int recordCount = 0;
 int qualifiedCount = 0;
 
-// Inner loop: within each batch iterate over the 
PartitionLocations
-for(PartitionLocation part: partitions){
-  if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1){
-newPartitions.add(part);
-qualifiedCount++;
+if (checkForSingle &&
+partitions.get(0).isCompositePartition() /* apply single 
partition check only for composite partitions */) {
+  // Inner loop: within each batch iterate over the 
PartitionLocations
+  for (PartitionLocation part : partitions) {
+assert part.isCompositePartition();
+if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1) {
+  newPartitions.add(part);
+  if (isSinglePartition) { // only need to do this if we are 
already single partition
+// compose the array of partition values for the 
directories that are referenced by filter:
+// e.g suppose the dir hierarchy is year/quarter/month and 
the query is:
+// SELECT * FROM T WHERE dir0=2015 AND dir1 = 'Q1',
+// then for 2015/Q1/Feb, this will have ['2015', 'Q1', 
null]
--- End diff --

For WHERE condition   dir0=2015 and dir2 = 'Jan', if the dataset happens to 
have only one 'Jan' under '2015' directory, will this qualify for 
singlePartitionOpt?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-20 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67709299
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -320,7 +377,17 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
   condition = condition.accept(reverseVisitor);
   pruneCondition = pruneCondition.accept(reverseVisitor);
 
-  RelNode inputRel = descriptor.createTableScan(newPartitions);
+  String cacheFileRoot = null;
+  if (checkForSingle && isSinglePartition) {
+// if metadata cache file could potentially be used, then assign a 
proper cacheFileRoot
+String path = "";
+for (int j = 0; j <= maxIndex; j++) {
+  path += "/" + spInfo[j];
--- End diff --

Related to Line 313, here we do not check spInfo[j] == null ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-18 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67602442
  
--- Diff: 
contrib/storage-hive/core/src/main/java/org/apache/drill/exec/planner/sql/HivePartitionDescriptor.java
 ---
@@ -151,7 +152,7 @@ protected void createPartitionSublists() {
   }
 
   @Override
-  public TableScan createTableScan(List newPartitions) 
throws Exception {
+  public TableScan createTableScan(List newPartitions, 
String cacheFileRoot) throws Exception {
--- End diff --

I suppose I could add a second interface method for createTableScan that 
allows the second parameter and keep the old method as-is. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-18 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67602385
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
 int recordCount = 0;
 int qualifiedCount = 0;
 
-// Inner loop: within each batch iterate over the 
PartitionLocations
-for(PartitionLocation part: partitions){
-  if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1){
-newPartitions.add(part);
-qualifiedCount++;
+if (checkForSingle &&
+partitions.get(0).isCompositePartition() /* apply single 
partition check only for composite partitions */) {
+  // Inner loop: within each batch iterate over the 
PartitionLocations
+  for (PartitionLocation part : partitions) {
--- End diff --

Yes, the rule has become too complex.  I will go ahead and at least 
refactor this specific optimization. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-18 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67602350
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
 int recordCount = 0;
 int qualifiedCount = 0;
 
-// Inner loop: within each batch iterate over the 
PartitionLocations
-for(PartitionLocation part: partitions){
-  if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1){
-newPartitions.add(part);
-qualifiedCount++;
+if (checkForSingle &&
+partitions.get(0).isCompositePartition() /* apply single 
partition check only for composite partitions */) {
+  // Inner loop: within each batch iterate over the 
PartitionLocations
+  for (PartitionLocation part : partitions) {
+assert part.isCompositePartition();
+if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1) {
+  newPartitions.add(part);
+  if (isSinglePartition) { // only need to do this if we are 
already single partition
+// compose the array of partition values for the 
directories that are referenced by filter:
+// e.g suppose the dir hierarchy is year/quarter/month and 
the query is:
+// SELECT * FROM T WHERE dir0=2015 AND dir1 = 'Q1',
+// then for 2015/Q1/Feb, this will have ['2015', 'Q1', 
null]
+// Note that we are not using the PartitionLocation here 
but composing a different list because
+// we are only interested in the directory columns that 
are referenced in the filter condition. not
+// the SELECT list or other parts of the query.
+Pair p = 
composePartition(referencedDirsBitSet, partitionMap, vectors, recordCount);
+String[] parts = p.getLeft();
+int tmpIndex = p.getRight();
+if (spInfo == null) {
+  spInfo = parts;
+  maxIndex = tmpIndex;
+} else if (maxIndex != tmpIndex) {
+  isSinglePartition = false;
+  break;
+} else {
+  // we only want to compare until the maxIndex inclusive 
since subsequent values would be null
+  for (int j = 0; j <= maxIndex; j++) {
+if (spInfo[j] == null // prefixes should be non-null
--- End diff --

If the query has for example WHERE dir2 = 'January' and does not have any 
condition on dir0 or dir1, then the spInfo array itself will be non-null but 
will have null elements in it:  [null, null, 'January']  and maxIndex = 2.  In 
this case, the single partition optimization should not be applied. I can add a 
comment here to explain. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-17 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67567993
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
 int recordCount = 0;
 int qualifiedCount = 0;
 
-// Inner loop: within each batch iterate over the 
PartitionLocations
-for(PartitionLocation part: partitions){
-  if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1){
-newPartitions.add(part);
-qualifiedCount++;
+if (checkForSingle &&
+partitions.get(0).isCompositePartition() /* apply single 
partition check only for composite partitions */) {
+  // Inner loop: within each batch iterate over the 
PartitionLocations
+  for (PartitionLocation part : partitions) {
+assert part.isCompositePartition();
+if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1) {
+  newPartitions.add(part);
+  if (isSinglePartition) { // only need to do this if we are 
already single partition
+// compose the array of partition values for the 
directories that are referenced by filter:
+// e.g suppose the dir hierarchy is year/quarter/month and 
the query is:
+// SELECT * FROM T WHERE dir0=2015 AND dir1 = 'Q1',
+// then for 2015/Q1/Feb, this will have ['2015', 'Q1', 
null]
+// Note that we are not using the PartitionLocation here 
but composing a different list because
+// we are only interested in the directory columns that 
are referenced in the filter condition. not
+// the SELECT list or other parts of the query.
+Pair p = 
composePartition(referencedDirsBitSet, partitionMap, vectors, recordCount);
+String[] parts = p.getLeft();
+int tmpIndex = p.getRight();
+if (spInfo == null) {
+  spInfo = parts;
+  maxIndex = tmpIndex;
+} else if (maxIndex != tmpIndex) {
+  isSinglePartition = false;
+  break;
+} else {
+  // we only want to compare until the maxIndex inclusive 
since subsequent values would be null
+  for (int j = 0; j <= maxIndex; j++) {
+if (spInfo[j] == null // prefixes should be non-null
--- End diff --

Form Line 305-306, spInfo and maxIndex are in sync. Why will we have 
spInfo[j] == null, when j <= maxIndex? I thought maxIndex is obtained such that 
element in spInfo is not null.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-17 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67567703
  
--- Diff: 
contrib/storage-hive/core/src/main/java/org/apache/drill/exec/planner/sql/HivePartitionDescriptor.java
 ---
@@ -151,7 +152,7 @@ protected void createPartitionSublists() {
   }
 
   @Override
-  public TableScan createTableScan(List newPartitions) 
throws Exception {
+  public TableScan createTableScan(List newPartitions, 
String cacheFileRoot) throws Exception {
--- End diff --

From the comment, cacheFileRoot could be null. But is it possible to not 
adding this parameter to HivePartitionDescriptor? After all, cacheFileRoot will 
not be applied to hive partition. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-17 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67567343
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
@@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
 int recordCount = 0;
 int qualifiedCount = 0;
 
-// Inner loop: within each batch iterate over the 
PartitionLocations
-for(PartitionLocation part: partitions){
-  if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1){
-newPartitions.add(part);
-qualifiedCount++;
+if (checkForSingle &&
+partitions.get(0).isCompositePartition() /* apply single 
partition check only for composite partitions */) {
+  // Inner loop: within each batch iterate over the 
PartitionLocations
+  for (PartitionLocation part : partitions) {
--- End diff --

Do you think it's possible to refactor this part of code such that only 
file system prune scan will have this logic?  This simplePartition optimization 
only applies to file system prune. Also, doOnMatch() method is getting bigger. 
Something like:  descriptor has supportsSinglePartOptimization(), and prune 
scan rule has doSinglePartOpt(), which is by default a no-op, and has 
implementation for file system prune rule.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-16 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67392462
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -208,8 +209,18 @@ public DrillTable isReadable(DrillFileSystem fs, 
FileSelection selection,
 FileSystemPlugin fsPlugin, String storageEngineName, String 
userName)
 throws IOException {
   // TODO: we only check the first file for directory reading.
-  if(selection.containsDirectories(fs)){
-if(isDirReadable(fs, selection.getFirstPath(fs))){
+  if(selection.containsDirectories(fs)) {
+Path dirMetaPath = new Path(selection.getSelectionRoot(), 
Metadata.METADATA_DIRECTORIES_FILENAME);
+if (fs.exists(dirMetaPath)) {
+  ParquetTableMetadataDirs mDirs = Metadata.readMetadataDirs(fs, 
dirMetaPath.toString());
+  if (mDirs.getDirectories().size() > 0) {
+FileSelection dirSelection = 
FileSelection.createFromDirectories(mDirs.getDirectories(), selection);
+dirSelection.setExpandedPartial();
+return new DynamicDrillTable(fsPlugin, storageEngineName, 
userName,
--- End diff --

make sense.. I will add a comment.  thanks for reviewing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-16 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67387102
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -208,8 +209,18 @@ public DrillTable isReadable(DrillFileSystem fs, 
FileSelection selection,
 FileSystemPlugin fsPlugin, String storageEngineName, String 
userName)
 throws IOException {
   // TODO: we only check the first file for directory reading.
-  if(selection.containsDirectories(fs)){
-if(isDirReadable(fs, selection.getFirstPath(fs))){
+  if(selection.containsDirectories(fs)) {
+Path dirMetaPath = new Path(selection.getSelectionRoot(), 
Metadata.METADATA_DIRECTORIES_FILENAME);
+if (fs.exists(dirMetaPath)) {
+  ParquetTableMetadataDirs mDirs = Metadata.readMetadataDirs(fs, 
dirMetaPath.toString());
+  if (mDirs.getDirectories().size() > 0) {
+FileSelection dirSelection = 
FileSelection.createFromDirectories(mDirs.getDirectories(), selection);
+dirSelection.setExpandedPartial();
+return new DynamicDrillTable(fsPlugin, storageEngineName, 
userName,
--- End diff --

A comment then, perhaps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-15 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67289117
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -208,8 +209,18 @@ public DrillTable isReadable(DrillFileSystem fs, 
FileSelection selection,
 FileSystemPlugin fsPlugin, String storageEngineName, String 
userName)
 throws IOException {
   // TODO: we only check the first file for directory reading.
-  if(selection.containsDirectories(fs)){
-if(isDirReadable(fs, selection.getFirstPath(fs))){
+  if(selection.containsDirectories(fs)) {
+Path dirMetaPath = new Path(selection.getSelectionRoot(), 
Metadata.METADATA_DIRECTORIES_FILENAME);
+if (fs.exists(dirMetaPath)) {
+  ParquetTableMetadataDirs mDirs = Metadata.readMetadataDirs(fs, 
dirMetaPath.toString());
+  if (mDirs.getDirectories().size() > 0) {
+FileSelection dirSelection = 
FileSelection.createFromDirectories(mDirs.getDirectories(), selection);
+dirSelection.setExpandedPartial();
+return new DynamicDrillTable(fsPlugin, storageEngineName, 
userName,
--- End diff --

I intentionally don't call isDirReadable() here because that method returns 
true if a metadata cache file exists and I am doing a similar check for the 
directories file here with fs.exists(dirMetaPath).If this check fails,  we 
will fall through to the old code path (line 223) which does check 
isDirReadable().   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-15 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/519#discussion_r67272839
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -208,8 +209,18 @@ public DrillTable isReadable(DrillFileSystem fs, 
FileSelection selection,
 FileSystemPlugin fsPlugin, String storageEngineName, String 
userName)
 throws IOException {
   // TODO: we only check the first file for directory reading.
-  if(selection.containsDirectories(fs)){
-if(isDirReadable(fs, selection.getFirstPath(fs))){
+  if(selection.containsDirectories(fs)) {
+Path dirMetaPath = new Path(selection.getSelectionRoot(), 
Metadata.METADATA_DIRECTORIES_FILENAME);
+if (fs.exists(dirMetaPath)) {
+  ParquetTableMetadataDirs mDirs = Metadata.readMetadataDirs(fs, 
dirMetaPath.toString());
+  if (mDirs.getDirectories().size() > 0) {
+FileSelection dirSelection = 
FileSelection.createFromDirectories(mDirs.getDirectories(), selection);
+dirSelection.setExpandedPartial();
+return new DynamicDrillTable(fsPlugin, storageEngineName, 
userName,
--- End diff --

Are you missing the isDirReadable() check here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #519: DRILL-4530: Optimize partition pruning with metadat...

2016-06-12 Thread amansinha100
GitHub user amansinha100 opened a pull request:

https://github.com/apache/drill/pull/519

DRILL-4530: Optimize partition pruning with metadata caching for the …

…single partition case.

 - Enhance PruneScanRule to detect single partitions based on referenced 
dirs in the filter.
 - Keep a new status of EXPANDED_PARTIAL for FileSelection.
 - Create separate .directories metadata file to prune directories first 
before files.
 - Introduce cacheFileRoot attribute to keep track of the parent directory 
of the cache file after partition pruning.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/amansinha100/incubator-drill DRILL-4530-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/519.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #519


commit 9c9687e804fa05c8f4b7b065738c458cb88bf5c4
Author: Aman Sinha 
Date:   2016-03-25T19:55:59Z

DRILL-4530: Optimize partition pruning with metadata caching for the single 
partition case.

 - Enhance PruneScanRule to detect single partitions based on referenced 
dirs in the filter.
 - Keep a new status of EXPANDED_PARTIAL for FileSelection.
 - Create separate .directories metadata file to prune directories first 
before files.
 - Introduce cacheFileRoot attribute to keep track of the parent directory 
of the cache file after partition pruning.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---