[ https://issues.apache.org/jira/browse/OAK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrei Dulceanu updated OAK-6659: --------------------------------- Attachment: OAK-6659.patch I changed a bit, hopefully for the better :-), the code in {{StandbyDiff}} so that all binary properties are loaded from the primary, before {{#propertyXXX}} methods are called. This seems to work quite nice (no unit/integration test failure). I also added a new test case in {{ExternalPrivateStoreIT}} for verifying the "fail loudly" improvement. Running the test with the old version of {{StandbyDiff}} shows clearly the 2 'sync blob' cycles I was talking about in the description. The blob sync even succeeds, given the fact that what happens is that the timeout is doubled, because of the two consecutive 'get blob' requests. This behaviour is wrong and is corrected by the changes in {{StandbyDiff}}. Running the test after the full patch is applied, correctly shows that only one 'sync blob' cycle happens, which, as expected, fails due to the short timeout used. In a next step I was thinking to refactor even more {{StandbyDiff}} to remove the (now) unneeded {{logOnly}} property. [~frm], could you take a look at the patch and share your opinion on the proposal above? > Cold standby should fail loudly when a big blob can't be timely transferred > --------------------------------------------------------------------------- > > Key: OAK-6659 > URL: https://issues.apache.org/jira/browse/OAK-6659 > Project: Jackrabbit Oak > Issue Type: Bug > Components: segment-tar, tarmk-standby > Affects Versions: 1.7.6 > Reporter: Andrei Dulceanu > Assignee: Andrei Dulceanu > Priority: Critical > Labels: cold-standby > Fix For: 1.7.8 > > Attachments: OAK-6659.patch > > > Due to changes done in OAK-4969, currently there are two 'sync blob' cycles > triggered by {{StandbyDiff#childNodeChanged}}. The test scenario is the same > as the one in {{DataStoreTestBase#testSyncBigBlob}}: on the primary file > store, a new big blob (1GB) is added and then a standby sync is triggered to > sync this content to the secondary file store. > The first 'sync blob' cycle happens as a result of {{#process}} being called > in {{StandbyDiff#childNodeChanged}}. As a result a new 'get blob' request is > created on the client and the server starts sending chunks from the big blob. > Now, if the time needed for transferring the entire blob from server to > client exceeds {{readTimeoutMs}} an {{IllegalStateException}} will be > correctly thrown by {{StandbyDiff#readBlob}}, but will be swallowed by the > {{StandbyDiff#childNodeChanged}} in its catch clause. A second 'sync blob' > cycle will be triggered and sometimes, this might succeed with the same > {{readTimeoutMs}} for which it was failing before. > The consequence of these two 'sync blob' cycles is that sometimes, deleting > the temporary file to which chunks are spooled to on the client fails (see > Windows for example and OAK-6641 specifically). This way, instead of deleting > the previous incomplete transfer, new chunks from the second 'sync blob' > cycle are added. The blob persisted in the blob store on the client won't > have the same size and id as the initial blob sent by the server. -- This message was sent by Atlassian JIRA (v6.4.14#64029)