Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/156#discussion_r39474948
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java
 ---
    @@ -125,4 +117,16 @@ private String getBaseTableLocation() {
         final FormatSelection origSelection = (FormatSelection) 
scanRel.getDrillTable().getSelection();
         return origSelection.getSelection().selectionRoot;
       }
    +
    +  @Override
    +  protected void createPartitionSublists() {
    +    Set<String> fileLocations = ((ParquetGroupScan) 
scanRel.getGroupScan()).getFileSet();
    +    List<PartitionLocation> locations = new LinkedList<>();
    +    for (String file: fileLocations) {
    +      locations.add(new DFSPartitionLocation(MAX_NESTED_SUBDIRS, 
getBaseTableLocation(), file));
    --- End diff --
    
    Actually, this patch was not about reducing memory footprint per se.  It 
was to eliminate the 64K files limit for partition pruning.  The above function 
logic is the same as we had before for getPartitions() plus the new splitting 
of the list into sublists.  The long filenames seem less of an issue for the 
JVM heap usage. Suppose we have 100K files each with name length 200 bytes.  
This is 20MB which is relatively low compared to the heap size.   However, we 
should try to build a better framework for propagating the filenames throughout 
the planning process.  Right now, methods such as FormatSelection.getAsFiles() 
populate all the filenames as once.   Ideally, these could also expose an 
iterator model. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to