Hello Tyler,

what was the sequence of events that led to this situation?

>From the symptoms described here, my guess would be:
- You had two nodes, A and B
- A was primary
- Someone ran a bad DB update on A, so you decided to restore a backup
- You shut down A (possibly also B), replaced A's data by what's in the
backup, started up A again
- A switched to the Primary role again, then attempted to create a DRBD
connection with B
- B still has the most recent data (with the bad DB update)

If this is what happened, then obviously from B's point of view, A's
DRBD looks like a volume that has some different version of the data,
which neither of the nodes has tracking information about (they don't
know how to update each other incrementally, so the only possible
solution is a full sync).

Since A is Primary, the data on A may be in use (e.g., a filesystem on
the device may be mounted), and it is therefore not a good idea to let a
resync change the data on A's disk, because then you could end up with
inconsistencies on A between what's on the disk and what's in the page
cache, thereby potentially breaking the filesystem's structure or
otherwise resulting in data loss. This is why DRBD refuses to start a
resync to a resource that is in the Primary role.

If A has the correct data now and is supposed to overwrite whatever is
on B, you would at least have to invalidate the data on B.

As you have potentially changed the disk state of A or B to
'Inconsistent' or 'Outdated' in the meantime, this might not work
anymore (DRBD does not resync from an Inconsistent or Oudated source).

Your best course of action to resolve this situation now would be:
(assuming that A has the correct data (the backup) and is supposed to
overwrite all data on B):
1. Shut down the DRBD resource on both nodes ('drbdadm down
<resource_name>')
    The resource must not be in use to be able to do this (e.g., not
mounted)
2. Create new metadata for this resource on both nodes ('drbdadm
create-md <resource_name>')
3. Start the DRBD resource on both nodes ('drbdadm up <resource_name>')
4. Make sure the resource comes up 'Connected' and 'Inconsistent' on
both nodes
5. Force A into 'UpToDate' and 'Primary', which will start a full sync to B
    (on A, enter 'drbdadm primary --force <resource_name>')

I hope that helps. In any case, do not discard the snapshot before you
have verified that the result is what you were expecting, just in case
that anything goes wrong (or had already gone wrong before you even
attempted this).

Best regards,
Robert

-- 
Robert Altnoeder
DRBD - Corosync - Pacemaker
+43 (1) 817 82 92 - 0 <tel:43181782920>
[email protected] <mailto:[email protected]>

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

On 12/05/2016 05:09 PM, Tyler Hains wrote:
> Hello,
>
> I had to restore a VM snapshot of a DRBD machine today to revert a 
> catastrophic DB update. However, after restoring the snapshot, I can't seem 
> to get the two nodes to connect. I have already followed the documented 
> recommendations for split-brain recovery, and the results are shown below.
>
> This is my primary:
> [root@MCM5-DB4 ~]# cat /proc/drbd
> version: 8.3.16 (api:88/proto:86-97)
> GIT-hash: a798fa7e274428a357657fb52f0ecf40192c1985 build by phil@Build64R6, 
> 2014-11-24 14:51:37
>  0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r-----
>     ns:0 nr:0 dw:3136608 dr:9182265 al:31009 bm:15031 lo:0 pe:0 ua:0 ap:0 
> ep:1 wo:f oos:1191752
>
> And from /var/log/messages on the primary:
> Dec  5 10:20:07 localhost kernel: block drbd0: conn( StandAlone -> 
> Unconnected )
> Dec  5 10:20:07 localhost kernel: block drbd0: Starting receiver thread (from 
> drbd0_worker [1629])
> Dec  5 10:20:07 localhost kernel: block drbd0: receiver (re)started
> Dec  5 10:20:07 localhost kernel: block drbd0: conn( Unconnected -> 
> WFConnection )
> Dec  5 10:20:07 localhost kernel: block drbd0: Handshake successful: Agreed 
> network protocol version 97
> Dec  5 10:20:07 localhost kernel: block drbd0: Peer authenticated using 20 
> bytes of 'sha1' HMAC
> Dec  5 10:20:07 localhost kernel: block drbd0: conn( WFConnection -> 
> WFReportParams )
> Dec  5 10:20:07 localhost kernel: block drbd0: Starting asender thread (from 
> drbd0_receiver [6450])
> Dec  5 10:20:07 localhost kernel: block drbd0: data-integrity-alg: <not-used>
> Dec  5 10:20:07 localhost kernel: block drbd0: drbd_sync_handshake:
> Dec  5 10:20:07 localhost kernel: block drbd0: self 
> 5E31C2DC5B55B225:D72E026811DB74F3:20B3ACD5A61DC8E5:20B2ACD5A61DC8E5 
> bits:297938 flags:0
> Dec  5 10:20:07 localhost kernel: block drbd0: peer 
> A98F08F32D2FCB34:0000000000000000:5E32C2DC5B55B224:5E31C2DC5B55B225 
> bits:183494167 flags:1
> Dec  5 10:20:07 localhost kernel: block drbd0: uuid_compare()=-2 by rule 60
> Dec  5 10:20:07 localhost kernel: block drbd0: I shall become SyncTarget, but 
> I am primary!
> Dec  5 10:20:07 localhost kernel: block drbd0: conn( WFReportParams -> 
> Disconnecting )
> Dec  5 10:20:07 localhost kernel: block drbd0: error receiving ReportState, 
> l: 4!
> Dec  5 10:20:07 localhost kernel: block drbd0: asender terminated
> Dec  5 10:20:07 localhost kernel: block drbd0: Terminating drbd0_asender
> Dec  5 10:20:07 localhost kernel: block drbd0: Connection closed
> Dec  5 10:20:07 localhost kernel: block drbd0: conn( Disconnecting -> 
> StandAlone )
> Dec  5 10:20:07 localhost kernel: block drbd0: receiver terminated
> Dec  5 10:20:07 localhost kernel: block drbd0: Terminating drbd0_receiver
>
> And this is my secondary:
> [root@MCM5-DB5 log]# cat /proc/drbd
> version: 8.3.16 (api:88/proto:86-97)
> GIT-hash: a798fa7e274428a357657fb52f0ecf40192c1985 build by phil@Build64R6, 
> 2014-11-24 14:51:37
>  0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----s
>     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:733976668
>
> With the log:
> Dec  5 10:20:07 localhost kernel: block drbd0: Handshake successful: Agreed 
> network protocol version 97
> Dec  5 10:20:07 localhost kernel: block drbd0: Peer authenticated using 20 
> bytes of 'sha1' HMAC
> Dec  5 10:20:07 localhost kernel: block drbd0: conn( WFConnection -> 
> WFReportParams )
> Dec  5 10:20:07 localhost kernel: block drbd0: Starting asender thread (from 
> drbd0_receiver [2171])
> Dec  5 10:20:07 localhost kernel: block drbd0: data-integrity-alg: <not-used>
> Dec  5 10:20:07 localhost kernel: block drbd0: drbd_sync_handshake:
> Dec  5 10:20:07 localhost kernel: block drbd0: self 
> A98F08F32D2FCB34:0000000000000000:5E32C2DC5B55B224:5E31C2DC5B55B225 
> bits:183494167 flags:0
> Dec  5 10:20:07 localhost kernel: block drbd0: peer 
> 5E31C2DC5B55B225:D72E026811DB74F3:20B3ACD5A61DC8E5:20B2ACD5A61DC8E5 
> bits:297938 flags:2
> Dec  5 10:20:07 localhost kernel: block drbd0: uuid_compare()=2 by rule 80
> Dec  5 10:20:07 localhost kernel: block drbd0: Writing the whole bitmap, full 
> sync required after drbd_sync_handshake.
> Dec  5 10:20:07 localhost kernel: block drbd0: meta connection shut down by 
> peer.
> Dec  5 10:20:07 localhost kernel: block drbd0: conn( WFReportParams -> 
> NetworkFailure )
> Dec  5 10:20:07 localhost kernel: block drbd0: asender terminated
> Dec  5 10:20:07 localhost kernel: block drbd0: Terminating drbd0_asender
> Dec  5 10:20:08 localhost kernel: block drbd0: bitmap WRITE of 5600 pages 
> took 183 jiffies
> Dec  5 10:20:08 localhost kernel: block drbd0: 700 GB (183494167 bits) marked 
> out-of-sync by on disk bit-map.
> Dec  5 10:20:08 localhost kernel: block drbd0: error receiving ReportState, 
> l: 4!
> Dec  5 10:20:08 localhost kernel: block drbd0: Connection closed
> Dec  5 10:20:08 localhost kernel: block drbd0: conn( NetworkFailure -> 
> Unconnected )
> Dec  5 10:20:08 localhost kernel: block drbd0: receiver terminated
> Dec  5 10:20:08 localhost kernel: block drbd0: Restarting drbd0_receiver
> Dec  5 10:20:08 localhost kernel: block drbd0: receiver (re)started
> Dec  5 10:20:08 localhost kernel: block drbd0: conn( Unconnected -> 
> WFConnection )
>
> Can anyone tell me how to get my Primary to connect and push its data over to 
> the secondary?
>
> Thanks!
> Tyler Hains
>
>
>
>
>
> The information contained in this email and any attachments is private and is 
> the confidential property of ROAM Data, Inc. If you are not the intended 
> recipient(s) or have otherwise received this email in error, please delete 
> this email and inform the sender as soon as possible. Neither this email nor 
> the information contained in any attachments may be disclosed, stored, used, 
> published or copied by anyone other than the intended recipient(s). All 
> orders for ROAM Data, Inc. products and services are accepted by ROAM Data, 
> Inc. subject to the terms and conditions of sale set forth on the ROAM Data, 
> Inc. website, as such terms and conditions of sale may be changed from time 
> to time without notice.
> _______________________________________________
> drbd-user mailing list
> [email protected]
> http://lists.linbit.com/mailman/listinfo/drbd-user

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to