Mark Payne created NIFI-15570:
---------------------------------
Summary: Partial defragmentation of Content Repository via
tail-claim truncation
Key: NIFI-15570
URL: https://issues.apache.org/jira/browse/NIFI-15570
Project: Apache NiFi
Issue Type: Improvement
Components: Core Framework
Reporter: Mark Payne
Assignee: Mark Payne
h3. Problem
NiFi's FileSystemRepository uses a slab-allocation strategy for storing
FlowFile content: multiple FlowFiles are written sequentially into a single
ResourceClaim file on disk. This is efficient because it avoids the overhead of
creating and deleting huge numbers of small files. However, it introduces a
fragmentation problem.When any FlowFile still references a ResourceClaim, the
entire file must be kept on disk — even if the vast majority of its bytes
belong to FlowFiles that have already been removed. Consider a ResourceClaim
that contains five ContentClaims of sizes 1 KB, 2 KB, 4 KB, 3 KB, and 1 GB. If
only the 1 KB FlowFile remains, the full ~1 GB file stays on disk. At scale,
this leads to disk exhaustion.A full defragmentation (rewriting live claims
into new ResourceClaim files, updating all references, and deleting the
originals) would be extremely complex and expensive. But it turns out we can
solve the vast majority of the problem without it.
h3. Key Insight
With NiFi's slab allocation, there are three possible positions for a large
ContentClaim within a ResourceClaim:
{{<> = Small FlowFile}}
[................] = Large FlowFile
1. Beginning: [................]<><><><><><><>
2. Middle: <><><><>[................]<><><>
3. End: <><><><><><><><>[................]
NiFi already prevents cases 1 and 2. The nifi.content.claim.max.appendable.size
property (default: 50 KB) causes the repository to stop appending to a
ResourceClaim once it exceeds that threshold. Since a "large" ContentClaim is
by definition larger than this threshold, the act of writing it will push the
ResourceClaim past the (soft) limit, causing the ResourceClaim to be closed for
further appending. No additional ContentClaims can be written after the large
one.This means a large ContentClaim can only ever appear at the tail of a
ResourceClaim. And truncating a file from the tail requires no data movement —
it is a single FileChannel.truncate() call.
h3. Solution
This change implements "partial defragmentation" by truncating ResourceClaim
files from the tail when the last (large) ContentClaim is removed. The approach
consists of several coordinated components:Marking truncation candidates at
write time — When a ContentClaim is closed in FileSystemRepository, the
repository checks whether it is both (a) large (exceeding a threshold) and (b)
at a non-zero offset (i.e., not the only claim in the file). If both conditions
hold, the claim is flagged as a truncation candidate via
StandardContentClaim.setTruncationCandidate(true). If the claim is later cloned
(claimant count incremented), the flag is cleared, since truncation is only
safe when the claim has a single owner.Routing truncatable claims through the
FlowFile Repository — When a FlowFile is deleted or its content is replaced,
WriteAheadFlowFileRepository.updateContentClaims() checks if the released
ContentClaim is a truncation candidate. If so (and the ResourceClaim itself is
not already fully destructable), the claim is queued in
claimsAwaitingTruncation. On the next WAL checkpoint or sync, these claims are
drained to ResourceClaimManager.markTruncatable().Background truncation in
FileSystemRepository — A scheduled TruncateClaims task periodically drains
truncatable claims from the ResourceClaimManager. Before truncating, it checks
whether truncation is active for the claim's container (archive must be cleared
on the last cleanup run and disk usage must exceed the configured threshold).
If conditions are met, the file is truncated to the claim's offset via
FileChannel.truncate(). If conditions are not met, the claims are saved in a
TruncationClaimManager and retried on subsequent runs, ensuring no truncation
opportunity is lost.Recovery — On restart,
WriteAheadFlowFileRepository.restoreFlowFiles() re-derives truncation
eligibility by scanning all recovered FlowFiles, identifying large claims at
non-zero offsets that are at the tail of their ResourceClaim and are not shared
by multiple FlowFiles.
h3. Example
Before truncation — a 1 GB FlowFile was removed but the ResourceClaim persists
because small FlowFiles still reference it:
{{ResourceClaim file (1,000,010 KB on disk):}}{{ [1 KB] [2 KB] [4 KB] [3 KB]
[1,000,000 KB (removed)]}}{{After truncation — the file is truncated at the
offset where the large claim began:}}
{{ResourceClaim file (10 KB on disk):}}{{ [1 KB] [2 KB] [4 KB] [3 KB]}}
The small FlowFiles remain fully readable. The 1 GB of wasted space is
reclaimed instantly with a single syscall.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)