Mark Payne created NIFI-11178:
---------------------------------

             Summary: Improve memory efficiency of ListHDFS
                 Key: NIFI-11178
                 URL: https://issues.apache.org/jira/browse/NIFI-11178
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Extensions
            Reporter: Mark Payne


ListHDFS is used extremely commonly. Typically, a listing consists of several 
hundred files or less. However, there are times (especially when performing the 
first listing) when the Processor is configured to recurse into subdirectories 
and creates a listing containing millions of files.

Currently, performing a listing containing millions of files can occupy several 
GB of heap space. Analyzing a recent heap dump, it was found that a listing of 
6.7 million files in HDFS occupied approximately 12 GB of heap space in NiFi. 
The heap usage can be tracked back to the fact that we hold in a local variable 
a HashSet<FileStatus> and these FileStatus objects occupy approximately 1-2 KB 
of heap each.

There are several improvements that can be made here, some small changes will 
have significant memory improvements. There are also large changes that could 
dramatically reduce heap utilization, to nearly nothing. But these would 
require complex rewrites of the Processor that would be much more difficult to 
maintain.

As a simple analysis, I built a small test to see how many FileStatus objects 
could be kept in memory:
{code:java}
final Set<FileStatus> statuses = new HashSet<>();
for (int i=0; i < 10_000_000; i++) {
    if (i % 10_000 == 0) {
        System.out.println(i);
    }
    final FsPermission fsPermission = new FsPermission(777);
    final FileStatus status = new FileStatus(2, true, 1, 4_000_000, 
System.currentTimeMillis(), 0L, fsPermission,
        "owner-" + i, "group-" + i, null, new Path("/path/to/my/file-" + i + 
".txt"), true, false, false);
    statuses.add(status);
}

{code}
This gives us a way to see how many FileStatus objects can be added to our set 
before we encounter OOME. With a 512 MB heap I reached approximately 1.13 
million FileStatus objects. Note that the Paths here are very small, which 
occupies less memory than would normally be the case.

Making one small change, to change the {{Set<FileStatus>}} to an 
{{ArrayList<FileStatus>}} yielded 1.21 million instead of 1.13 million objects. 
This is reasonable, since we don't expect any duplicates anyway.

Another small change that was made was to introduce a new class that keeps only 
the fields we care about from the FileStatus:
{code:java}
private static class HdfsEntity {
    private final String path;
    private final long length;
    private final boolean isdir;
    private final long timestamp;
    private final boolean canRead;
    private final boolean canWrite;
    private final boolean canExecute;
    private final boolean encrypted;

    public HdfsEntity(final FileStatus status) {
        this.path = status.getPath().getName();
        this.length = status.getLen();
        this.isdir = status.isDirectory();
        this.timestamp = status.getModificationTime();
        this.canRead = 
status.getPermission().getGroupAction().implies(FsAction.READ);
        this.canWrite = 
status.getPermission().getGroupAction().implies(FsAction.WRITE);
        this.canExecute = 
status.getPermission().getGroupAction().implies(FsAction.EXECUTE);
        this.encrypted = status.isEncrypted();
    }
}
 {code}
This introduced significant savings, allowing a {{List<HdfsEntity>}} to store 
5.2 million objects instead of 1.13 million.

It is worth noting here that the HdfsEntity doesn't store the 'group' and 
'owner' that are part of the FileStatus. Interestingly, though, these values 
are strings that are extremely repetitive, but are unique strings on the heap. 
So a full solution here would mean tracking the owner and group, but doing so 
in a way that we use a Map<String, String> or something similar in order to 
reuse identical Strings on the heap (in much the same way that String.intern() 
does, but without interning the String as it's only reusable within a small 
context).

These changes will yield significant memory improvements.

However, more penetrating changes can make even large improvements:
 * Rather than keeping a Set or a List of entities at all, we could instead 
just iterate over each of the FileStatus objects returned from HDFS and 
determine whether or not the file should be included in the listing. If so, 
create the FlowFile or write the Record to the RecordWriter. Eliminate the 
collection all together. This would likely result in code that is similar to 
ListS3, which has an internal {{S3ObjectWriter}} class. This is used to create 
an interface that can be used regardless of whether a RecordWriter is being 
used or not. This would provide very significant memory improvements. In the 
case of using a Record Writer, it may even reduce the size to nearly 0.
 * However, when not using a Record Writer, we still will have the FlowFiles 
kept in memory until the session is committed. One method for dealing with this 
would be to change our algorithm so that instead of performing a single listing 
of the directory and all sub-directories, we instead sort sub-directories by 
name and then commit the session and update state for individual 
sub-directories. This would introduce significant risk, though, in ensuring 
that we don't create message duplication upon restart and that we don't lose 
data. So it's perhaps not the best option.
 * Alternatively, we could document the concern when not using a Record Writer 
and provide info in an additionalDetails.html that shows the preferred method 
for using the processor when expecting many files. ListHDFS would be connected 
to SplitRecord that would split into chunks of say 10,000 Records. This would 
then go to a PartitionRecord that would create attributes for the fields. This 
would give us the same result as outputting without the Record Writer but in 
such a way that is much more memory efficient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to