Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/156#discussion_r39474948 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java --- @@ -125,4 +117,16 @@ private String getBaseTableLocation() { final FormatSelection origSelection = (FormatSelection) scanRel.getDrillTable().getSelection(); return origSelection.getSelection().selectionRoot; } + + @Override + protected void createPartitionSublists() { + Set<String> fileLocations = ((ParquetGroupScan) scanRel.getGroupScan()).getFileSet(); + List<PartitionLocation> locations = new LinkedList<>(); + for (String file: fileLocations) { + locations.add(new DFSPartitionLocation(MAX_NESTED_SUBDIRS, getBaseTableLocation(), file)); --- End diff -- Actually, this patch was not about reducing memory footprint per se. It was to eliminate the 64K files limit for partition pruning. The above function logic is the same as we had before for getPartitions() plus the new splitting of the list into sublists. The long filenames seem less of an issue for the JVM heap usage. Suppose we have 100K files each with name length 200 bytes. This is 20MB which is relatively low compared to the heap size. However, we should try to build a better framework for propagating the filenames throughout the planning process. Right now, methods such as FormatSelection.getAsFiles() populate all the filenames as once. Ideally, these could also expose an iterator model.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---