[ 
https://issues.apache.org/jira/browse/OAK-12212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Hoh updated OAK-12212:
----------------------------
    Description: 

h3. Problem
The page rendering of an AEM instance got very slow; inspection of the metrics 
showed that the element count in the SegmentDiskCache was consistently close to 
0 (instead in the tens of thousands as normal), and here was a very high rate 
of evictions. Threaddumps show that these requests were constantly reaching out 
to the blobstore.

h3. Observation
A heap dump of a long-running instance shows:

* PersistentDiskCache.maxCacheSizeBytes ≈ 20 GiB (matches the configured value)
* AbstractPersistentCache.cacheSize (an AtomicLong, inherited) ≈ 80 GiB — 
roughly 4× the configured maximum
The actual cache directory on disk stays at or below the configured limit; only 
the in-memory counter has run away.


h3. Root cause
{{PersistentDiskCache.writeSegment(...)}} adds {{fileSize}} to the in-memory 
{{cacheSize}} on every invocation that reaches the write body, but the 
corresponding file on disk is replaced — not added — when the same segment id 
is written more than once. The writesPending guard inside {{writeSegment}} only 
prevents concurrently running tasks for the same id; it does not prevent 
sequentially submitted tasks. On POSIX file systems, {{Files.move(..., 
ATOMIC_MOVE)}} maps to rename(2) and silently replaces the destination, so the 
second (and subsequent) writes leave the directory unchanged in size while 
still incrementing the counter.

The eviction loop ({{cleanUpInternal}}) walks the directory and subtracts the 
actual length of each deleted file once. The "phantom" bytes contributed by 
redundant writes are therefore never repaid and accumulate monotonically over 
the lifetime of the JVM.

In addition, two smaller contributing factors keep the drift unidirectional 
(upward):

* cacheSize is initialized to 0 and is never reconciled against the existing 
cache directory at startup; it relies entirely on incremental accounting being 
correct.
* The error branch of {{writeSegment}} deletes segmentFile on any 
{{Files.move}} failure but does not decrement the counter for whatever 
contribution that file previously made.
Triggering workloads Any workload that produces multiple writes for the same 
segment id over time: concurrent cache misses on the same segment (e.g. 
compaction, online GC, indexing, mass traversal, standby replication, warm-up 
after restart). The probability per workload determines the rate at which the 
counter diverges — instances that run weeks/months will drift by tens of GiB 
regardless of how the workload looks at any given moment.



  was:
h2. Observation
A heap dump of a long-running instance shows:

* PersistentDiskCache.maxCacheSizeBytes ≈ 20 GiB (matches the configured value)
* AbstractPersistentCache.cacheSize (an AtomicLong, inherited) ≈ 80 GiB — 
roughly 4× the configured maximum
The actual cache directory on disk stays at or below the configured limit; only 
the in-memory counter has run away.


h2. Root cause
{{PersistentDiskCache.writeSegment(...)}} adds {{fileSize}} to the in-memory 
{{cacheSize}} on every invocation that reaches the write body, but the 
corresponding file on disk is replaced — not added — when the same segment id 
is written more than once. The writesPending guard inside {{writeSegment}} only 
prevents concurrently running tasks for the same id; it does not prevent 
sequentially submitted tasks. On POSIX file systems, {{Files.move(..., 
ATOMIC_MOVE)}} maps to rename(2) and silently replaces the destination, so the 
second (and subsequent) writes leave the directory unchanged in size while 
still incrementing the counter.

The eviction loop ({{cleanUpInternal}}) walks the directory and subtracts the 
actual length of each deleted file once. The "phantom" bytes contributed by 
redundant writes are therefore never repaid and accumulate monotonically over 
the lifetime of the JVM.

In addition, two smaller contributing factors keep the drift unidirectional 
(upward):

* cacheSize is initialized to 0 and is never reconciled against the existing 
cache directory at startup; it relies entirely on incremental accounting being 
correct.
* The error branch of {{writeSegment}} deletes segmentFile on any 
{{Files.move}} failure but does not decrement the counter for whatever 
contribution that file previously made.
Triggering workloads Any workload that produces multiple writes for the same 
segment id over time: concurrent cache misses on the same segment (e.g. 
compaction, online GC, indexing, mass traversal, standby replication, warm-up 
after restart). The probability per workload determines the rate at which the 
counter diverges — instances that run weeks/months will drift by tens of GiB 
regardless of how the workload looks at any given moment.




> Drifts in PersistentDiskCache.cacheSize counter
> -----------------------------------------------
>
>                 Key: OAK-12212
>                 URL: https://issues.apache.org/jira/browse/OAK-12212
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: segment-azure
>    Affects Versions: 2.0.0
>            Reporter: Joerg Hoh
>            Assignee: Joerg Hoh
>            Priority: Major
>
> h3. Problem
> The page rendering of an AEM instance got very slow; inspection of the 
> metrics showed that the element count in the SegmentDiskCache was 
> consistently close to 0 (instead in the tens of thousands as normal), and 
> here was a very high rate of evictions. Threaddumps show that these requests 
> were constantly reaching out to the blobstore.
> h3. Observation
> A heap dump of a long-running instance shows:
> * PersistentDiskCache.maxCacheSizeBytes ≈ 20 GiB (matches the configured 
> value)
> * AbstractPersistentCache.cacheSize (an AtomicLong, inherited) ≈ 80 GiB — 
> roughly 4× the configured maximum
> The actual cache directory on disk stays at or below the configured limit; 
> only the in-memory counter has run away.
> h3. Root cause
> {{PersistentDiskCache.writeSegment(...)}} adds {{fileSize}} to the in-memory 
> {{cacheSize}} on every invocation that reaches the write body, but the 
> corresponding file on disk is replaced — not added — when the same segment id 
> is written more than once. The writesPending guard inside {{writeSegment}} 
> only prevents concurrently running tasks for the same id; it does not prevent 
> sequentially submitted tasks. On POSIX file systems, {{Files.move(..., 
> ATOMIC_MOVE)}} maps to rename(2) and silently replaces the destination, so 
> the second (and subsequent) writes leave the directory unchanged in size 
> while still incrementing the counter.
> The eviction loop ({{cleanUpInternal}}) walks the directory and subtracts the 
> actual length of each deleted file once. The "phantom" bytes contributed by 
> redundant writes are therefore never repaid and accumulate monotonically over 
> the lifetime of the JVM.
> In addition, two smaller contributing factors keep the drift unidirectional 
> (upward):
> * cacheSize is initialized to 0 and is never reconciled against the existing 
> cache directory at startup; it relies entirely on incremental accounting 
> being correct.
> * The error branch of {{writeSegment}} deletes segmentFile on any 
> {{Files.move}} failure but does not decrement the counter for whatever 
> contribution that file previously made.
> Triggering workloads Any workload that produces multiple writes for the same 
> segment id over time: concurrent cache misses on the same segment (e.g. 
> compaction, online GC, indexing, mass traversal, standby replication, warm-up 
> after restart). The probability per workload determines the rate at which the 
> counter diverges — instances that run weeks/months will drift by tens of GiB 
> regardless of how the workload looks at any given moment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to