Scott Yuan created OAK-11817:
--------------------------------
Summary: Add configurable strict blob verification to
RemoteBlobProcessor to prevent missing blob files in Cold Standby
Key: OAK-11817
URL: https://issues.apache.org/jira/browse/OAK-11817
Project: Jackrabbit Oak
Issue Type: Bug
Components: segment-tar
Affects Versions: 1.22.22
Environment: RHEL 9 + JDK 11+ Apache Sling 1.22
Reporter: Scott Yuan
With a properly configured TarMK cold standby (e.g., Adobe Experience Manager
6.5.23) utilizing Apache Jackrabbit Oak segment-tar cold standby with an
external BlobStore, the cold standby occasionally creates {*}random missing
blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak source
code, it appears that the current logic in _RemoteBlobProcessor.java_ assumes
that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is not null,
then the blob is physically present and readable.
However, in real-world scenarios, especially with eventual or out-of-band blob
synchronization (e.g., via rsync), this assumption can be incorrect. The blob
may be:
* Not yet copied
* Deleted by GC
* Corrupted or unreadable
This leads to runtime errors when the standby node tries to read missing blobs
that were assumed present.
+*Proposal:*+
Introduce a new *OSGi configuration property* in _StandbyStoreService_ called
_strictBlobVerify_ When enabled:
* {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from the
blob to verify it is {*}physically present and readable{*}.
* If the check fails, the blob is {*}re-fetched from the primary node{*}.
This adds a safeguard against false positives from the reference existing check
only approach and ensures Cold Standby is more robust in environments with
non-instantaneous blob synchronization. It allows administrators to toggle
strict blob verification behavior depending on their setup (e.g. dev vs
production).
+*Implementation Plan:*+
# Add _strictBlobVerify_ to _StandbyStoreServiceConfiguration_
# Read this flag in _StandbyStoreService_
# Pass it to _RemoteBlobProcessor_
# In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference
readable if _strictBlobVerify_ has been specified.
+*Benefits:*+
* Improves reliability of Cold Standby in environments with delayed or
out-of-band blob sync
* Prevents silent corruption or missing blobs
* Configurable to preserve existing behavior for users who don’t need it
Example Configuration:
{noformat}
#
org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg
strictBlobVerify=true{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)