[
https://issues.apache.org/jira/browse/OAK-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Scott Yuan updated OAK-11817:
-----------------------------
Description:
With a properly configured TarMK cold standby with Jackrabbit Oak based
solution utilizing Apache Jackrabbit Oak segment-tar cold standby with an
external BlobStore, the cold standby occasionally creates {*}random missing
blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak source
code, it appears that the current logic in _RemoteBlobProcessor.java_ assumes
that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is not null,
then the blob is physically present and readable.
However, in real-world scenarios, especially with eventual or out-of-band blob
synchronization (e.g., via rsync), this assumption can be incorrect. The blob
may be:
* Not yet copied
* Deleted by GC
* Corrupted or unreadable
This leads to runtime errors when the standby node tries to read missing blobs
that were assumed present.
+*Proposal:*+
Introduce a new *OSGi configuration property* in _StandbyStoreService_ called
_strictBlobVerify_ When enabled:
* {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from the
blob to verify it is {*}physically present and readable{*}.
* If the check fails, the blob is {*}re-fetched from the primary node{*}.
This adds a safeguard against false positives from the reference existing check
only approach and ensures Cold Standby is more robust in environments with
non-instantaneous blob synchronization. It allows administrators to toggle
strict blob verification behavior depending on their setup (e.g. dev vs
production).
+*Implementation Plan:*+
# Add _verifyBlobFileOnSync_ to _StandbyStoreServiceConfiguration_
# Read this flag in _StandbyStoreService_
# Pass it to _RemoteBlobProcessor_
# In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference
readable if _verifyBlobFileOnSync_ has been specified.
+*Benefits:*+
* Improves reliability of Cold Standby in environments with delayed or
out-of-band blob sync
* Prevents silent corruption or missing blobs
* Configurable to preserve existing behavior for users who don’t need it
Example Configuration:
{noformat}
#
org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg
verifyBlobFileOnSync=true{noformat}
Please feel free to assign this to me as I would be providing a PR shortly.
was:
With a properly configured TarMK cold standby with Jackrabbit Oak based
solution utilizing Apache Jackrabbit Oak segment-tar cold standby with an
external BlobStore, the cold standby occasionally creates {*}random missing
blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak source
code, it appears that the current logic in _RemoteBlobProcessor.java_ assumes
that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is not null,
then the blob is physically present and readable.
However, in real-world scenarios, especially with eventual or out-of-band blob
synchronization (e.g., via rsync), this assumption can be incorrect. The blob
may be:
* Not yet copied
* Deleted by GC
* Corrupted or unreadable
This leads to runtime errors when the standby node tries to read missing blobs
that were assumed present.
+*Proposal:*+
Introduce a new *OSGi configuration property* in _StandbyStoreService_ called
_strictBlobVerify_ When enabled:
* {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from the
blob to verify it is {*}physically present and readable{*}.
* If the check fails, the blob is {*}re-fetched from the primary node{*}.
This adds a safeguard against false positives from the reference existing check
only approach and ensures Cold Standby is more robust in environments with
non-instantaneous blob synchronization. It allows administrators to toggle
strict blob verification behavior depending on their setup (e.g. dev vs
production).
+*Implementation Plan:*+
# Add _strictBlobVerify_ to _StandbyStoreServiceConfiguration_
# Read this flag in _StandbyStoreService_
# Pass it to _RemoteBlobProcessor_
# In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference
readable if _strictBlobVerify_ has been specified.
+*Benefits:*+
* Improves reliability of Cold Standby in environments with delayed or
out-of-band blob sync
* Prevents silent corruption or missing blobs
* Configurable to preserve existing behavior for users who don’t need it
Example Configuration:
{noformat}
#
org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg
strictBlobVerify=true{noformat}
Please feel free to assign this to me as I would be providing a PR shortly.
> Add configurable strict blob verification to RemoteBlobProcessor to prevent
> missing blob files in Cold Standby
> --------------------------------------------------------------------------------------------------------------
>
> Key: OAK-11817
> URL: https://issues.apache.org/jira/browse/OAK-11817
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: segment-tar
> Affects Versions: 1.22.22
> Environment: RHEL 9 + JDK 11+ Apache Sling 1.22
> Reporter: Scott Yuan
> Priority: Major
>
> With a properly configured TarMK cold standby with Jackrabbit Oak based
> solution utilizing Apache Jackrabbit Oak segment-tar cold standby with an
> external BlobStore, the cold standby occasionally creates {*}random missing
> blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak
> source code, it appears that the current logic in _RemoteBlobProcessor.java_
> assumes that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is
> not null, then the blob is physically present and readable.
> However, in real-world scenarios, especially with eventual or out-of-band
> blob synchronization (e.g., via rsync), this assumption can be incorrect. The
> blob may be:
> * Not yet copied
> * Deleted by GC
> * Corrupted or unreadable
> This leads to runtime errors when the standby node tries to read missing
> blobs that were assumed present.
> +*Proposal:*+
> Introduce a new *OSGi configuration property* in _StandbyStoreService_ called
> _strictBlobVerify_ When enabled:
> * {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from
> the blob to verify it is {*}physically present and readable{*}.
> * If the check fails, the blob is {*}re-fetched from the primary node{*}.
> This adds a safeguard against false positives from the reference existing
> check only approach and ensures Cold Standby is more robust in environments
> with non-instantaneous blob synchronization. It allows administrators to
> toggle strict blob verification behavior depending on their setup (e.g. dev
> vs production).
> +*Implementation Plan:*+
> # Add _verifyBlobFileOnSync_ to _StandbyStoreServiceConfiguration_
> # Read this flag in _StandbyStoreService_
> # Pass it to _RemoteBlobProcessor_
> # In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference
> readable if _verifyBlobFileOnSync_ has been specified.
> +*Benefits:*+
> * Improves reliability of Cold Standby in environments with delayed or
> out-of-band blob sync
> * Prevents silent corruption or missing blobs
> * Configurable to preserve existing behavior for users who don’t need it
>
> Example Configuration:
> {noformat}
> #
> org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg
> verifyBlobFileOnSync=true{noformat}
>
> Please feel free to assign this to me as I would be providing a PR shortly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)