Mark Payne created NIFI-15570:
---------------------------------

             Summary: Partial defragmentation of Content Repository via 
tail-claim truncation
                 Key: NIFI-15570
                 URL: https://issues.apache.org/jira/browse/NIFI-15570
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Core Framework
            Reporter: Mark Payne
            Assignee: Mark Payne


h3. Problem

NiFi's FileSystemRepository uses a slab-allocation strategy for storing 
FlowFile content: multiple FlowFiles are written sequentially into a single 
ResourceClaim file on disk. This is efficient because it avoids the overhead of 
creating and deleting huge numbers of small files. However, it introduces a 
fragmentation problem.When any FlowFile still references a ResourceClaim, the 
entire file must be kept on disk — even if the vast majority of its bytes 
belong to FlowFiles that have already been removed. Consider a ResourceClaim 
that contains five ContentClaims of sizes 1 KB, 2 KB, 4 KB, 3 KB, and 1 GB. If 
only the 1 KB FlowFile remains, the full ~1 GB file stays on disk. At scale, 
this leads to disk exhaustion.A full defragmentation (rewriting live claims 
into new ResourceClaim files, updating all references, and deleting the 
originals) would be extremely complex and expensive. But it turns out we can 
solve the vast majority of the problem without it.
h3. Key Insight

With NiFi's slab allocation, there are three possible positions for a large 
ContentClaim within a ResourceClaim:
 
 
{{<> = Small FlowFile}}
[................] = Large FlowFile
 
1. Beginning:   [................]<><><><><><><>
 
2. Middle:      <><><><>[................]<><><>
 
3. End:         <><><><><><><><>[................]

 
NiFi already prevents cases 1 and 2. The nifi.content.claim.max.appendable.size 
property (default: 50 KB) causes the repository to stop appending to a 
ResourceClaim once it exceeds that threshold. Since a "large" ContentClaim is 
by definition larger than this threshold, the act of writing it will push the 
ResourceClaim past the (soft) limit, causing the ResourceClaim to be closed for 
further appending. No additional ContentClaims can be written after the large 
one.This means a large ContentClaim can only ever appear at the tail of a 
ResourceClaim. And truncating a file from the tail requires no data movement — 
it is a single FileChannel.truncate() call.

h3. Solution

This change implements "partial defragmentation" by truncating ResourceClaim 
files from the tail when the last (large) ContentClaim is removed. The approach 
consists of several coordinated components:Marking truncation candidates at 
write time — When a ContentClaim is closed in FileSystemRepository, the 
repository checks whether it is both (a) large (exceeding a threshold) and (b) 
at a non-zero offset (i.e., not the only claim in the file). If both conditions 
hold, the claim is flagged as a truncation candidate via 
StandardContentClaim.setTruncationCandidate(true). If the claim is later cloned 
(claimant count incremented), the flag is cleared, since truncation is only 
safe when the claim has a single owner.Routing truncatable claims through the 
FlowFile Repository — When a FlowFile is deleted or its content is replaced, 
WriteAheadFlowFileRepository.updateContentClaims() checks if the released 
ContentClaim is a truncation candidate. If so (and the ResourceClaim itself is 
not already fully destructable), the claim is queued in 
claimsAwaitingTruncation. On the next WAL checkpoint or sync, these claims are 
drained to ResourceClaimManager.markTruncatable().Background truncation in 
FileSystemRepository — A scheduled TruncateClaims task periodically drains 
truncatable claims from the ResourceClaimManager. Before truncating, it checks 
whether truncation is active for the claim's container (archive must be cleared 
on the last cleanup run and disk usage must exceed the configured threshold). 
If conditions are met, the file is truncated to the claim's offset via 
FileChannel.truncate(). If conditions are not met, the claims are saved in a 
TruncationClaimManager and retried on subsequent runs, ensuring no truncation 
opportunity is lost.Recovery — On restart, 
WriteAheadFlowFileRepository.restoreFlowFiles() re-derives truncation 
eligibility by scanning all recovered FlowFiles, identifying large claims at 
non-zero offsets that are at the tail of their ResourceClaim and are not shared 
by multiple FlowFiles.
h3. Example

Before truncation — a 1 GB FlowFile was removed but the ResourceClaim persists 
because small FlowFiles still reference it:
 
 
 
{{ResourceClaim file (1,000,010 KB on disk):}}{{  [1 KB] [2 KB] [4 KB] [3 KB] 
[1,000,000 KB (removed)]}}{{After truncation — the file is truncated at the 
offset where the large claim began:}}
 
{{ResourceClaim file (10 KB on disk):}}{{  [1 KB] [2 KB] [4 KB] [3 KB]}}

 
The small FlowFiles remain fully readable. The 1 GB of wasted space is 
reclaimed instantly with a single syscall.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to