[
https://issues.apache.org/jira/browse/OAK-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180374#comment-16180374
]
Francesco Mari commented on OAK-6678:
-------------------------------------
After looking at the patch, I found myself wondering if we still need the retry
logic at all, both when reading the segment and when reading the persisted head
state.
About the head state, if a persisted head state doesn't yet exist on the
primary, the standby instance should not interpret this as an error condition.
Instead, the standby should handle the situation by gracefully abort the
current synchronization. The standby will try again at the next
synchronization. This implies that a "null" head state response should be sent
from the primary to the standby in case a persisted head state doesn't exist on
the primary.
About reading segments, the retry logic should not be necessary if we guarantee
that the head state is always a persisted one. If the head state is persisted,
then it lives in a persisted segment, which can only reference other persisted
segments. Of course the standby should still handle connection timeouts, but
this is a different concern. The primary should not need to read a segment with
a retry if the synchronization starts from a persisted head state.
[~dulceanu], what do you think about this? Am I missing something?
> Syncing big blobs fails since StandbyServer sends persisted head
> ----------------------------------------------------------------
>
> Key: OAK-6678
> URL: https://issues.apache.org/jira/browse/OAK-6678
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: segment-tar, tarmk-standby
> Reporter: Andrei Dulceanu
> Assignee: Andrei Dulceanu
> Labels: cold-standby, resilience
> Fix For: 1.8, 1.7.9
>
> Attachments: OAK-6678.patch
>
>
> With changes for OAK-6653 in place,
> {{ExternalPrivateStoreIT#testSyncBigBlog}} and sometimes
> {{ExternalSharedStoreIT#testSyncBigBlob}} are failing on CI:
> {noformat}
> org.apache.jackrabbit.oak.segment.standby.ExternalSharedStoreIT
> testSyncBigBlob(org.apache.jackrabbit.oak.segment.standby.ExternalSharedStoreIT)
> Time elapsed: 96.82 sec <<< FAILURE!
> java.lang.AssertionError: expected:<{ root = { ... } }> but was:<{ root : { }
> }>
> ...
> testSyncBigBlob(org.apache.jackrabbit.oak.segment.standby.ExternalPrivateStoreIT)
> Time elapsed: 95.254 sec <<< FAILURE!
> java.lang.AssertionError: expected:<{ root = { ... } }> but was:<{ root : { }
> }>
> {noformat}
> Partial stacktrace:
> {noformat}
> 14:09:08.355 DEBUG [main] StandbyServer.java:242 Binding was
> successful
> 14:09:08.358 DEBUG [standby-1] GetHeadRequestEncoder.java:33 Sending request
> from client Bar for current head
> 14:09:08.359 DEBUG [primary-1] ClientFilterHandler.java:53 Client
> /127.0.0.1:52988 is allowed
> 14:09:08.360 DEBUG [primary-1] RequestDecoder.java:42 Parsed 'get head'
> message
> 14:09:08.360 DEBUG [primary-1] CommunicationObserver.java:79 Message 'get
> head' received from client Bar
> 14:09:08.362 DEBUG [primary-1] GetHeadRequestHandler.java:43 Reading head for
> client Bar
> 14:09:08.363 WARN [primary-1] ExceptionHandler.java:31 Exception caught
> on the server
> java.lang.NullPointerException: null
> at
> org.apache.jackrabbit.oak.segment.standby.server.DefaultStandbyHeadReader.readHeadRecordId(DefaultStandbyHeadReader.java:32)
> ~[oak-segment-tar-1.8-SNAPSHOT.jar:1.8-SNAPSHOT]
> at
> org.apache.jackrabbit.oak.segment.standby.server.GetHeadRequestHandler.channelRead0(GetHeadRequestHandler.java:45)
> ~[oak-segment-tar-1.8-SNAPSHOT.jar:1.8-SNAPSHOT]
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)