Github user amansinha100 commented on a diff in the pull request:
https://github.com/apache/drill/pull/156#discussion_r39474948
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java
---
@@ -125,4 +117,16 @@ private String getBaseTableLocation() {
final FormatSelection origSelection = (FormatSelection)
scanRel.getDrillTable().getSelection();
return origSelection.getSelection().selectionRoot;
}
+
+ @Override
+ protected void createPartitionSublists() {
+ Set<String> fileLocations = ((ParquetGroupScan)
scanRel.getGroupScan()).getFileSet();
+ List<PartitionLocation> locations = new LinkedList<>();
+ for (String file: fileLocations) {
+ locations.add(new DFSPartitionLocation(MAX_NESTED_SUBDIRS,
getBaseTableLocation(), file));
--- End diff --
Actually, this patch was not about reducing memory footprint per se. It
was to eliminate the 64K files limit for partition pruning. The above function
logic is the same as we had before for getPartitions() plus the new splitting
of the list into sublists. The long filenames seem less of an issue for the
JVM heap usage. Suppose we have 100K files each with name length 200 bytes.
This is 20MB which is relatively low compared to the heap size. However, we
should try to build a better framework for propagating the filenames throughout
the planning process. Right now, methods such as FormatSelection.getAsFiles()
populate all the filenames as once. Ideally, these could also expose an
iterator model.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---