Hi, Is there anyone who can provide guidance on a cold standby issue?
It appears that my cold standby has made progress since my original email (below), but now I'm seeing a behavior that I didn't expect. The tarmk.log (below) for my cold standby shows that the head is being updated every 5-10 hours, with the longer intervals corresponding to 3pm to midnight local time. The primary's JMX standby metrics show that the number of transferred segments is not changing, while the number of transferred binaries continues to increase. I did some other checks on the read caches and the shared S3 bucket, and it looks like read cache contents are continually being transferred from the primary to the secondary, over and over again. Is this what I should see? If not, how do I troubleshoot this? Thanks! John Logan tarmk.log: 2016-11-05 23:30:05,045 sending head request 2016-11-05 23:30:05,045 did send head request 2016-11-05 23:30:05,081 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09 2016-11-06 10:09:25,044 sending head request 2016-11-06 10:09:25,044 did send head request 2016-11-06 10:09:25,092 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09 2016-11-06 15:07:50,045 sending head request 2016-11-06 15:07:50,045 did send head request 2016-11-06 15:07:50,082 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09 2016-11-06 20:10:20,044 sending head request 2016-11-06 20:10:20,045 did send head request 2016-11-06 20:10:20,120 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09 2016-11-07 06:26:50,044 sending head request 2016-11-07 06:26:50,044 did send head request 2016-11-07 06:26:50,085 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09 2016-11-07 11:23:15,045 sending head request 2016-11-07 11:23:15,045 did send head request 2016-11-07 11:23:15,140 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09 2016-11-07 17:07:00,045 sending head request 2016-11-07 17:07:00,045 did send head request 2016-11-07 17:07:00,141 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09 2016-11-07 22:01:40,052 sending head request 2016-11-07 22:01:40,052 did send head request 2016-11-07 22:01:40,099 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09 2016-11-08 08:26:35,044 sending head request 2016-11-08 08:26:35,044 did send head request 2016-11-08 08:26:35,183 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09 2016-11-08 14:05:50,046 sending head request 2016-11-08 14:05:50,046 did send head request 2016-11-08 14:05:50,159 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09 On Wed, 2016-11-02 at 17:28 +0000, John Logan wrote: > Hi, > > I'm setting up a TarMK cold standby for a repository for the first time, and > have a couple of questions regarding synchronization and administration. > I've included the configuration and current dump of the primary and standby > MBeans below. The primary and standby are in peered VPCs in AWS, using a > shared S3 bucket for blob storage. > > 1.) I'm curious as to how long I should expect to wait for the standby > to establish synchronization. How much data gets moved over the wire? > I'm seeing a steady stream of read cache invalidations on the standby - > does this mean that all of the blob data must be transferred, even > though the two repositories use shared storage? > > 2.) I see in the logs a period where there are read cache invalidations, > and then there is a 12 hour period where nothing is logged, followed > by a > "org.apache.jackrabbit.oak.plugins.segment.standby.client.SegmentLoaderHandler > timeout" > message. The quiet period is consistent with my setting > standby.readtimeout=I"43200000". Would it make sense to choose a > shorter timeout to lessen the impact of occasional network issues? > At what point might the timeout value be "too short"? > > 3.) Is there a definitive way to know that the standby is synced? > The SyncEndTimestamp value below corresponds to 2016-11-02T09:26:18+00:00, > which corresponds exactly to the timestamp of the > "SegmentLoaderHandler timeout" message. This suggests that this > value doesn't really tell me that the standby is synchronized. > When I tried with small repositories, it appears that synchronization > was done when the tarmk.log file started outputting the same repository > head every 5 seconds ("interval" setting). > > 4.) Assuming that the standby eventually becomes synchronized, > is there a documented procedure by which I could "split the mirror"; > that is, convert the standby into an new, independent primary > containing a replica of the original? If the current primary > and standby are referring to S3 bucket "P", could I shut down > both instances, copy the contents of bucket "P" to a new bucket > "S", update the standby Oak S3 configuration to refer to the new > bucket "S", and restart what was the standby as a new primary? > Are there other steps I would need to take? > > Thanks! John > > > CONFIG VALUES FOR BOTH INSTANCES > > > STANDBY CONFIG: > > > /var/lib/sling/install/install.standby/org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.config: > org.apache.sling.installer.configuration.persist=B"false" > port=I"8023" > secure=B"true" > mode="standby" > primary.host="john-proto.dev" > interval=I"5" > standby.readtimeout=I"43200000" > > > PRIMARY CONFIG: > > > /var/lib/sling/install/install.primary/org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.config > org.apache.sling.installer.configuration.persist=B"false" > port=I"8023" > secure=B"true" > mode="primary" > primary.allowed-client-ip-ranges=["0.0.0.0-255.255.255.255"] > > > OAK S3 CONFIG: > > > /var/lib/sling/install/oak_s3/org.apache.jackrabbit.oak.plugins.blob.datastore.SharedS3DataStore.config: > accessKey="" > secretKey="" > s3Bucket="my-primary-bucket" > s3Region="us-west-2" > s3EndPoint="s3-us-west-2.amazonaws.com" > connectionTimeout="120000" > socketTimeout="120000" > maxConnections="40" > writeThreads="30" > maxErrorRetry="10" > > > JMX MBEANS > > > STANDBY: > > > #mbean = > org.apache.jackrabbit.oak:id="fa2b9a7c-fc69-4a0c-aa7e-b0cfc61bd1c6",name=Status,type="Standby": > FailedRequests = 0; > > SecondsSinceLastSuccess = 24269; > > SyncStartTimestamp = 1478021232280; > > SyncEndTimestamp = 1478078778813; > > Status = running; > > Running = true; > > Mode = client: fa2b9a7c-fc69-4a0c-aa7e-b0cfc61bd1c6; > > > PRIMARY: > > #mbean = org.apache.jackrabbit.oak:id=8023,name=Status,type="Standby": > Status = got message; > > Running = true; > > Mode = primary; > > #mbean = org.apache.jackrabbit.oak:id="Client > fa2b9a7c-fc69-4a0c-aa7e-b0cfc61bd1c6",name=Status,type="Standby": > RemotePort = 44322; > > RemoteAddress = 10.16.12.44; > > LastSeenTimestamp = Wed Nov 02 13:48:59 UTC 2016; > > TransferredSegments = 186780; > > TransferredSegmentBytes = 1198693232; > > TransferredBinaries = 5579; > > TransferredBinariesBytes = 170312256398; > > LastRequest = b.678851bb77bec68db82c6bda37aca8e763d8a32e#655084301; > > Name = fa2b9a7c-fc69-4a0c-aa7e-b0cfc61bd1c6;