[ https://issues.apache.org/jira/browse/NIFI-11178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tamas Palfy updated NIFI-11178: ------------------------------- Fix Version/s: 2.0.0 1.23.0 Resolution: Fixed Status: Resolved (was: Patch Available) > Improve memory efficiency of ListHDFS > ------------------------------------- > > Key: NIFI-11178 > URL: https://issues.apache.org/jira/browse/NIFI-11178 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions > Reporter: Mark Payne > Assignee: Lehel Boér > Priority: Major > Fix For: 2.0.0, 1.23.0 > > Attachments: image-2023-06-28-20-48-40-182.png, > image-2023-06-28-20-48-57-012.png, image-2023-06-28-20-53-21-210.png, > image-2023-06-28-20-54-00-887.png, image-2023-06-28-20-54-32-188.png > > Time Spent: 4h 50m > Remaining Estimate: 0h > > ListHDFS is used extremely commonly. Typically, a listing consists of several > hundred files or less. However, there are times (especially when performing > the first listing) when the Processor is configured to recurse into > subdirectories and creates a listing containing millions of files. > Currently, performing a listing containing millions of files can occupy > several GB of heap space. Analyzing a recent heap dump, it was found that a > listing of 6.7 million files in HDFS occupied approximately 12 GB of heap > space in NiFi. The heap usage can be tracked back to the fact that we hold in > a local variable a HashSet<FileStatus> and these FileStatus objects occupy > approximately 1-2 KB of heap each. > There are several improvements that can be made here, some small changes will > have significant memory improvements. There are also large changes that could > dramatically reduce heap utilization, to nearly nothing. But these would > require complex rewrites of the Processor that would be much more difficult > to maintain. > As a simple analysis, I built a small test to see how many FileStatus objects > could be kept in memory: > {code:java} > final Set<FileStatus> statuses = new HashSet<>(); > for (int i=0; i < 10_000_000; i++) { > if (i % 10_000 == 0) { > System.out.println(i); > } > final FsPermission fsPermission = new FsPermission(777); > final FileStatus status = new FileStatus(2, true, 1, 4_000_000, > System.currentTimeMillis(), 0L, fsPermission, > "owner-" + i, "group-" + i, null, new Path("/path/to/my/file-" + i + > ".txt"), true, false, false); > statuses.add(status); > } > {code} > This gives us a way to see how many FileStatus objects can be added to our > set before we encounter OOME. With a 512 MB heap I reached approximately 1.13 > million FileStatus objects. Note that the Paths here are very small, which > occupies less memory than would normally be the case. > Making one small change, to change the {{Set<FileStatus>}} to an > {{ArrayList<FileStatus>}} yielded 1.21 million instead of 1.13 million > objects. This is reasonable, since we don't expect any duplicates anyway. > Another small change that was made was to introduce a new class that keeps > only the fields we care about from the FileStatus: > {code:java} > private static class HdfsEntity { > private final String path; > private final long length; > private final boolean isdir; > private final long timestamp; > private final boolean canRead; > private final boolean canWrite; > private final boolean canExecute; > private final boolean encrypted; > public HdfsEntity(final FileStatus status) { > this.path = status.getPath().getName(); > this.length = status.getLen(); > this.isdir = status.isDirectory(); > this.timestamp = status.getModificationTime(); > this.canRead = > status.getPermission().getGroupAction().implies(FsAction.READ); > this.canWrite = > status.getPermission().getGroupAction().implies(FsAction.WRITE); > this.canExecute = > status.getPermission().getGroupAction().implies(FsAction.EXECUTE); > this.encrypted = status.isEncrypted(); > } > } > {code} > This introduced significant savings, allowing a {{List<HdfsEntity>}} to store > 5.2 million objects instead of 1.13 million. > It is worth noting here that the HdfsEntity doesn't store the 'group' and > 'owner' that are part of the FileStatus. Interestingly, though, these values > are strings that are extremely repetitive, but are unique strings on the > heap. So a full solution here would mean tracking the owner and group, but > doing so in a way that we use a Map<String, String> or something similar in > order to reuse identical Strings on the heap (in much the same way that > String.intern() does, but without interning the String as it's only reusable > within a small context). > These changes will yield significant memory improvements. > However, more penetrating changes can make even large improvements: > * Rather than keeping a Set or a List of entities at all, we could instead > just iterate over each of the FileStatus objects returned from HDFS and > determine whether or not the file should be included in the listing. If so, > create the FlowFile or write the Record to the RecordWriter. Eliminate the > collection all together. This would likely result in code that is similar to > ListS3, which has an internal {{S3ObjectWriter}} class. This is used to > create an interface that can be used regardless of whether a RecordWriter is > being used or not. This would provide very significant memory improvements. > In the case of using a Record Writer, it may even reduce the size to nearly 0. > * However, when not using a Record Writer, we still will have the FlowFiles > kept in memory until the session is committed. One method for dealing with > this would be to change our algorithm so that instead of performing a single > listing of the directory and all sub-directories, we instead sort > sub-directories by name and then commit the session and update state for > individual sub-directories. This would introduce significant risk, though, in > ensuring that we don't create message duplication upon restart and that we > don't lose data. So it's perhaps not the best option. > * Alternatively, we could document the concern when not using a Record > Writer and provide info in an additionalDetails.html that shows the preferred > method for using the processor when expecting many files. ListHDFS would be > connected to SplitRecord that would split into chunks of say 10,000 Records. > This would then go to a PartitionRecord that would create attributes for the > fields. This would give us the same result as outputting without the Record > Writer but in such a way that is much more memory efficient. -- This message was sent by Atlassian Jira (v8.20.10#820010)