[
https://issues.apache.org/jira/browse/OAK-12212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joerg Hoh updated OAK-12212:
----------------------------
Description:
h3. Problem
The page rendering of an AEM instance got very slow; inspection of the metrics
showed that the element count in the SegmentDiskCache was consistently close to
0 (instead in the tens of thousands as normal), and here was a very high rate
of evictions. Threaddumps show that these requests were constantly reaching out
to the blobstore.
h3. Observation
A heap dump of a long-running instance shows:
* PersistentDiskCache.maxCacheSizeBytes ≈ 20 GiB (matches the configured value)
* AbstractPersistentCache.cacheSize (an AtomicLong, inherited) ≈ 80 GiB —
roughly 4× the configured maximum
The actual cache directory on disk stays at or below the configured limit; only
the in-memory counter has run away.
h3. Root cause
{{PersistentDiskCache.writeSegment(...)}} adds {{fileSize}} to the in-memory
{{cacheSize}} on every invocation that reaches the write body, but the
corresponding file on disk is replaced — not added — when the same segment id
is written more than once. The writesPending guard inside {{writeSegment}} only
prevents concurrently running tasks for the same id; it does not prevent
sequentially submitted tasks. On POSIX file systems, {{Files.move(...,
ATOMIC_MOVE)}} maps to rename(2) and silently replaces the destination, so the
second (and subsequent) writes leave the directory unchanged in size while
still incrementing the counter.
The eviction loop ({{cleanUpInternal}}) walks the directory and subtracts the
actual length of each deleted file once. The "phantom" bytes contributed by
redundant writes are therefore never repaid and accumulate monotonically over
the lifetime of the JVM.
In addition, two smaller contributing factors keep the drift unidirectional
(upward):
* cacheSize is initialized to 0 and is never reconciled against the existing
cache directory at startup; it relies entirely on incremental accounting being
correct.
* The error branch of {{writeSegment}} deletes segmentFile on any
{{Files.move}} failure but does not decrement the counter for whatever
contribution that file previously made.
Triggering workloads Any workload that produces multiple writes for the same
segment id over time: concurrent cache misses on the same segment (e.g.
compaction, online GC, indexing, mass traversal, standby replication, warm-up
after restart). The probability per workload determines the rate at which the
counter diverges — instances that run weeks/months will drift by tens of GiB
regardless of how the workload looks at any given moment.
was:
h2. Observation
A heap dump of a long-running instance shows:
* PersistentDiskCache.maxCacheSizeBytes ≈ 20 GiB (matches the configured value)
* AbstractPersistentCache.cacheSize (an AtomicLong, inherited) ≈ 80 GiB —
roughly 4× the configured maximum
The actual cache directory on disk stays at or below the configured limit; only
the in-memory counter has run away.
h2. Root cause
{{PersistentDiskCache.writeSegment(...)}} adds {{fileSize}} to the in-memory
{{cacheSize}} on every invocation that reaches the write body, but the
corresponding file on disk is replaced — not added — when the same segment id
is written more than once. The writesPending guard inside {{writeSegment}} only
prevents concurrently running tasks for the same id; it does not prevent
sequentially submitted tasks. On POSIX file systems, {{Files.move(...,
ATOMIC_MOVE)}} maps to rename(2) and silently replaces the destination, so the
second (and subsequent) writes leave the directory unchanged in size while
still incrementing the counter.
The eviction loop ({{cleanUpInternal}}) walks the directory and subtracts the
actual length of each deleted file once. The "phantom" bytes contributed by
redundant writes are therefore never repaid and accumulate monotonically over
the lifetime of the JVM.
In addition, two smaller contributing factors keep the drift unidirectional
(upward):
* cacheSize is initialized to 0 and is never reconciled against the existing
cache directory at startup; it relies entirely on incremental accounting being
correct.
* The error branch of {{writeSegment}} deletes segmentFile on any
{{Files.move}} failure but does not decrement the counter for whatever
contribution that file previously made.
Triggering workloads Any workload that produces multiple writes for the same
segment id over time: concurrent cache misses on the same segment (e.g.
compaction, online GC, indexing, mass traversal, standby replication, warm-up
after restart). The probability per workload determines the rate at which the
counter diverges — instances that run weeks/months will drift by tens of GiB
regardless of how the workload looks at any given moment.
> Drifts in PersistentDiskCache.cacheSize counter
> -----------------------------------------------
>
> Key: OAK-12212
> URL: https://issues.apache.org/jira/browse/OAK-12212
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: segment-azure
> Affects Versions: 2.0.0
> Reporter: Joerg Hoh
> Assignee: Joerg Hoh
> Priority: Major
>
> h3. Problem
> The page rendering of an AEM instance got very slow; inspection of the
> metrics showed that the element count in the SegmentDiskCache was
> consistently close to 0 (instead in the tens of thousands as normal), and
> here was a very high rate of evictions. Threaddumps show that these requests
> were constantly reaching out to the blobstore.
> h3. Observation
> A heap dump of a long-running instance shows:
> * PersistentDiskCache.maxCacheSizeBytes ≈ 20 GiB (matches the configured
> value)
> * AbstractPersistentCache.cacheSize (an AtomicLong, inherited) ≈ 80 GiB —
> roughly 4× the configured maximum
> The actual cache directory on disk stays at or below the configured limit;
> only the in-memory counter has run away.
> h3. Root cause
> {{PersistentDiskCache.writeSegment(...)}} adds {{fileSize}} to the in-memory
> {{cacheSize}} on every invocation that reaches the write body, but the
> corresponding file on disk is replaced — not added — when the same segment id
> is written more than once. The writesPending guard inside {{writeSegment}}
> only prevents concurrently running tasks for the same id; it does not prevent
> sequentially submitted tasks. On POSIX file systems, {{Files.move(...,
> ATOMIC_MOVE)}} maps to rename(2) and silently replaces the destination, so
> the second (and subsequent) writes leave the directory unchanged in size
> while still incrementing the counter.
> The eviction loop ({{cleanUpInternal}}) walks the directory and subtracts the
> actual length of each deleted file once. The "phantom" bytes contributed by
> redundant writes are therefore never repaid and accumulate monotonically over
> the lifetime of the JVM.
> In addition, two smaller contributing factors keep the drift unidirectional
> (upward):
> * cacheSize is initialized to 0 and is never reconciled against the existing
> cache directory at startup; it relies entirely on incremental accounting
> being correct.
> * The error branch of {{writeSegment}} deletes segmentFile on any
> {{Files.move}} failure but does not decrement the counter for whatever
> contribution that file previously made.
> Triggering workloads Any workload that produces multiple writes for the same
> segment id over time: concurrent cache misses on the same segment (e.g.
> compaction, online GC, indexing, mass traversal, standby replication, warm-up
> after restart). The probability per workload determines the rate at which the
> counter diverges — instances that run weeks/months will drift by tens of GiB
> regardless of how the workload looks at any given moment.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)