[DRBD-user] Recover node after hard-crash (drbd9)

Christiaan den Besten Mon, 29 Jun 2015 12:52:58 -0700

Hi !

Just posted this on the drbd-dev list, but I think it might be better for 
drbd-users to share with others:


I just started testing the (official) drbd9 modules and tools on thee CentOS 7 
VMx (running on Xen Hypervisor) in PV mode.

My DRBD ‘cluster’ has 3 nodes :

cluster1-a.storage.as41887.net
cluster1-b.storage.as41887.net
cluster1-c.storage.as41887.net

No actual Cluster software (pacemaker) has been installed just yet. Trying to 
get the hand of ‘drbdmanage’ and do some testing on the recoverability of 
resources after various scenario’s.

Scenario that I can’t seem to ‘fix’ (or am doing wrong … )

* Create a resource/volume through drbdmanage (puts it on node-b and node-c)
* Remove data from node-c and assign node-a as a secondary data node
* Make node-c a Diskless client node, create a filesystem on the drbd-volume 
and mount it

Then to see how we can can recover from a hard error I started a fio 
disk-benchmark on node-c and few seconds later kill the node-a by destroying 
the VM. All keeps running as expected, but I can’t seem to get node-a back into 
normal shape without having to remove te Primary state from node-c.

I followed to following steps, please advise where I went wrong :

IPv7:~ chris$ cat /tmp/drbd-replay.txt 

[root@cluster1-a ~]# drbdmanage new-volume testvm_prolocation_net 10GB --deploy 
2
[root@cluster1-a ~]# drbdmanage nodes
+------------------------------------------------------------------------------------------------------------+
| Name                           | Pool Size | Pool Free |                      
                     | State |
+------------------------------------------------------------------------------------------------------------+
| cluster1-a.storage.as41887.net |    204796 |    204792 |                      
                     |    ok |
| cluster1-b.storage.as41887.net |    204796 |    195252 |                      
                     |    ok |
| cluster1-c.storage.as41887.net |    204796 |    195252 |                      
                     |    ok |
+------------------------------------------------------------------------------------------------------------+

[root@cluster1-a ~]# drbdmanage unassign testvm_prolocation_net 
cluster1-c.storage.as41887.net
[root@cluster1-a ~]# drbdmanage assign testvm_prolocation_net 
cluster1-a.storage.as41887.net
[root@cluster1-a ~]# drbdmanage assign --client testvm_prolocation_net 
cluster1-c.storage.as41887.net

[root@cluster1-a ~]# drbdmanage nodes
+------------------------------------------------------------------------------------------------------------+
| Name                           | Pool Size | Pool Free |                      
                     | State |
+------------------------------------------------------------------------------------------------------------+
| cluster1-a.storage.as41887.net |    204796 |    195252 |                      
                     |    ok |
| cluster1-b.storage.as41887.net |    204796 |    195252 |                      
                     |    ok |
| cluster1-c.storage.as41887.net |    204796 |    204792 |                      
                     |    ok |
+------------------------------------------------------------------------------------------------------------+

Affer a few minutes (node-a needs to sync data from node-b) :

[root@cluster1-a ~]# drbdadm status | grep 'testvm_prolocation_net' -A 6
testvm_prolocation_net role:Secondary
 disk:UpToDate
 cluster1-b.storage.as41887.net role:Secondary
   peer-disk:UpToDate
 cluster1-c.storage.as41887.net role:Secondary
   peer-disk:Diskless

[root@cluster1-c ~]# mkfs.ext4 /dev/drbd10 
[root@cluster1-c ~]# mount /dev/drbd10 /mnt

[root@cluster1-a ~]# drbdadm status | grep 'testvm_prolocation_net' -A 6
testvm_prolocation_net role:Secondary
 disk:UpToDate
 cluster1-b.storage.as41887.net role:Secondary
   peer-disk:UpToDate
 cluster1-c.storage.as41887.net role:Primary
   peer-disk:Diskless

[root@cluster1-c mnt]# fio 
/usr/share/doc/fio-2.2.8/examples/iometer-file-access-server.fio

To fake a nice hard crash of one of the data-nodes : xm destroy 
cluster1-a.storage.as41887.net

Before booting cluster1-a :

[root@cluster1-c mnt]# drbdadm status | grep testvm_prolocation_net -A 6
testvm_prolocation_net role:Primary
 disk:Diskless
 cluster1-a.storage.as41887.net connection:Connecting
 cluster1-b.storage.as41887.net role:Secondary
   peer-disk:UpToDate

After booting cluster1-a, now the fun starts ;)

[root@cluster1-c mnt]# drbdadm status | grep testvm_prolocation_net -A 6
testvm_prolocation_net role:Primary
 disk:Diskless
 cluster1-a.storage.as41887.net connection:StandAlone
 cluster1-b.storage.as41887.net role:Secondary
   peer-disk:UpToDate

[root@cluster1-a ~]# drbdadm status | grep 'testvm_prolocation_net' -A 6
testvm_prolocation_net role:Secondary
 disk:Outdated
 cluster1-b.storage.as41887.net connection:StandAlone
 cluster1-c.storage.as41887.net connection:Connecting

To resolve our 'split brain' I discard the data on node-a

[root@cluster1-a ~]# drbdadm --discard-my-data connect testvm_prolocation_net

[root@cluster1-a ~]# drbdadm status | grep 'testvm_prolocation_net' -A 6
testvm_prolocation_net role:Secondary
 disk:Outdated
 cluster1-b.storage.as41887.net connection:Connecting
 cluster1-c.storage.as41887.net connection:Connecting

[root@cluster1-b ~]# drbdadm connect testvm_prolocation_net

Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Handshake successful: Agreed network protocol 
version 110
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Agreed to support TRIM on protocol level
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Peer authenticated using 20 bytes HMAC
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Starting ack_recv thread (from drbd_r_testvm_p 
[2297])
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Preparing remote state change 2288137045 
(primary_nodes=0, weak_nodes=0)
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Committing remote state change 2288137045
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: conn( Connecting -> Connected ) peer( Unknown 
-> Secondary )
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: drbd_sync_handshake:
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: self 
BF8F65EBAABAF138:0000000000000000:0000000000000000:0000000000000000 bits:0 
flags:0
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: peer 
374B6E4C8ECEB4F4:2CA8DD9FBE1E313E:0000000000000000:0000000000000000 bits:225445 
flags:100
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: uuid_compare()=-100 by rule 100
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
initial-split-brain
Jun 29 13:47:52 cluster1-a drbdadm[2320]: Don't know which config file belongs 
to resource testvm_prolocation_net, trying default ones...
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
initial-split-brain exit code 0 (0x0)
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10: 
Split-Brain detected, manually solved. Sync from peer node
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: pdsk( DUnknown -> UpToDate ) repl( Off -> 
WFBitMapT )
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: receive bitmap stats [Bytes(packets)]: plain 
0(0), RLE 82229(21), total 82229; compression: 73.1%
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: send bitmap stats [Bytes(packets)]: plain 0(0), 
RLE 82229(21), total 82229; compression: 73.1%
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
before-resync-target
Jun 29 13:47:52 cluster1-a drbdadm[2322]: Don't know which config file belongs 
to resource testvm_prolocation_net, trying default ones...
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
before-resync-target exit code 0 (0x0)
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10: disk( 
Outdated -> Inconsistent )
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: repl( WFBitMapT -> SyncTarget )
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-c.storage.as41887.net: resync-susp( no -> connection dependency )
Jun 29 13:47:52 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: Began resync as SyncTarget (will sync 901780 KB 
[225445 bits set]).

[root@cluster1-a ~]# drbdadm status | grep 'testvm_prolocation_net' -A 6
testvm_prolocation_net role:Secondary
 disk:Inconsistent
 cluster1-b.storage.as41887.net role:Secondary
   replication:SyncTarget peer-disk:UpToDate done:91.61
 cluster1-c.storage.as41887.net connection:Connecting

end up in ... :

Jun 29 13:48:48 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: Resync done (total 56 sec; paused 0 sec; 16100 
K/sec)
Jun 29 13:48:48 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: Peer was unstable during resync
Jun 29 13:48:48 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: updated UUIDs 
BF8F65EBAABAF138:0000000000000000:0000000000000000:0000000000000000
Jun 29 13:48:48 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: repl( SyncTarget -> Established )
Jun 29 13:48:48 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-c.storage.as41887.net: resync-susp( connection dependency -> no )
Jun 29 13:48:48 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
after-resync-target
Jun 29 13:48:48 cluster1-a drbdadm[2393]: Don't know which config file belongs 
to resource testvm_prolocation_net, trying default ones...
Jun 29 13:48:48 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
after-resync-target exit code 0 (0x0)

[root@cluster1-a ~]# drbdadm status | grep 'testvm_prolocation_net' -A 6
testvm_prolocation_net role:Secondary
 disk:Inconsistent
 cluster1-b.storage.as41887.net role:Secondary
   peer-disk:UpToDate
 cluster1-c.storage.as41887.net connection:Connecting

To get it going I need to unmount the mountpoint on the Diskless DRBD client 
node-c.

[root@cluster1-c ~]# umount /mnt

[root@cluster1-c ~]# drbdadm status | grep testvm_prolocation_net -A 6
testvm_prolocation_net role:Secondary
 disk:Diskless
 cluster1-a.storage.as41887.net connection:StandAlone
 cluster1-b.storage.as41887.net role:Secondary
   peer-disk:UpToDate

[root@cluster1-a ~]# drbdadm disconnect testvm_prolocation_net

[root@cluster1-b ~]# drbdadm status | grep testvm_prolocation_net -A 6
testvm_prolocation_net role:Secondary
 disk:UpToDate
 cluster1-a.storage.as41887.net connection:Connecting
 cluster1-c.storage.as41887.net role:Secondary
   peer-disk:Diskless

[root@cluster1-a ~]# drbdadm --discard-my-data connect testvm_prolocation_net

Jun 29 13:54:37 cluster1-a kernel: drbd testvm_prolocation_net 
tcp:cluster1-c.storage.as41887.net: Closing unexpected connection from 
94.228.142.34 to port 7700
Jun 29 13:54:44 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: conn( StandAlone -> Unconnected )
Jun 29 13:54:44 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Starting receiver thread (from drbd_w_testvm_p 
[1330])
Jun 29 13:54:44 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: conn( Unconnected -> Connecting )
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Handshake successful: Agreed network protocol 
version 110
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Agreed to support TRIM on protocol level
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Peer authenticated using 20 bytes HMAC
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Starting ack_recv thread (from drbd_r_testvm_p 
[2481])
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Preparing remote state change 439073605 
(primary_nodes=0, weak_nodes=0)
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: Committing remote state change 439073605
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net 
cluster1-b.storage.as41887.net: conn( Connecting -> Connected ) peer( Unknown 
-> Secondary )
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: drbd_sync_handshake:
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: self 
BF8F65EBAABAF138:0000000000000000:0000000000000000:0000000000000000 bits:0 
flags:0
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: peer 
374B6E4C8ECEB4F4:2CA8DD9FBE1E313E:0000000000000000:0000000000000000 bits:2 
flags:120
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: uuid_compare()=-100 by rule 100
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
initial-split-brain
Jun 29 13:54:45 cluster1-a drbdadm[2485]: Don't know which config file belongs 
to resource testvm_prolocation_net, trying default ones...
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
initial-split-brain exit code 0 (0x0)
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10: 
Split-Brain detected, manually solved. Sync from peer node
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: pdsk( DUnknown -> UpToDate ) repl( Off -> 
WFBitMapT )
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: receive bitmap stats [Bytes(packets)]: plain 
0(0), RLE 25(1), total 25; compression: 100.0%
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: send bitmap stats [Bytes(packets)]: plain 0(0), 
RLE 25(1), total 25; compression: 100.0%
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
before-resync-target
Jun 29 13:54:45 cluster1-a drbdadm[2487]: Don't know which config file belongs 
to resource testvm_prolocation_net, trying default ones...
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
before-resync-target exit code 0 (0x0)
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: repl( WFBitMapT -> SyncTarget )
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-c.storage.as41887.net: resync-susp( no -> connection dependency )
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: Began resync as SyncTarget (will sync 8 KB [2 
bits set]).
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: Resync done (total 1 sec; paused 0 sec; 8 K/sec)
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: updated UUIDs 
374B6E4C8ECEB4F4:0000000000000000:BF8F65EBAABAF138:406F8167C11BEB40
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10: disk( 
Inconsistent -> UpToDate )
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: repl( SyncTarget -> Established )
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-c.storage.as41887.net: pdsk( DUnknown -> Outdated ) resync-susp( 
connection dependency -> no )
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
after-resync-target
Jun 29 13:54:45 cluster1-a drbdadm[2489]: Don't know which config file belongs 
to resource testvm_prolocation_net, trying default ones...
Jun 29 13:54:45 cluster1-a kernel: drbd testvm_prolocation_net/0 drbd10 
cluster1-b.storage.as41887.net: helper command: /sbin/drbdadm 
after-resync-target exit code 0 (0x0)

[root@cluster1-a ~]# drbdadm status | grep testvm_prolocation_net -A 6
testvm_prolocation_net role:Secondary
 disk:UpToDate
 cluster1-b.storage.as41887.net role:Secondary
   peer-disk:UpToDate
 cluster1-c.storage.as41887.net connection:Connecting

Now I still need to re-connect node-c to the recovered node-a

[root@cluster1-c ~]# drbdadm status | grep testvm_prolocation_net -A 6
testvm_prolocation_net role:Secondary
 disk:Diskless
 cluster1-a.storage.as41887.net connection:StandAlone
 cluster1-b.storage.as41887.net role:Secondary
   peer-disk:UpToDate

[root@cluster1-c ~]# drbdadm connect testvm_prolocation_net
testvm_prolocation_net: Failure: (125) Device has a net-config (use disconnect 
first)
Command 'drbdsetup connect testvm_prolocation_net 0' terminated with exit code 
10

[root@cluster1-c ~]# drbdadm connect testvm_prolocation_net --peer 
cluster1-a.storage.as41887.net
testvm_prolocation_net: Failure: (125) Device has a net-config (use disconnect 
first)
Command 'drbdsetup connect testvm_prolocation_net 0' terminated with exit code 
10

[root@cluster1-c ~]# drbdadm disconnect testvm_prolocation_net
[root@cluster1-c ~]# drbdadm connect testvm_prolocation_net
[root@cluster1-c ~]# drbdadm status | grep testvm_prolocation_net -A 6
testvm_prolocation_net role:Secondary
 disk:Diskless
 cluster1-a.storage.as41887.net role:Secondary
   peer-disk:UpToDate
 cluster1-b.storage.as41887.net role:Secondary
   peer-disk:UpToDate

Finally, all is synced and well connected again ….

Above steps where a replay I did today after having seen this behaviour last 
night. So it seems to be reproducible. Also wondering if the ‘recover’ steps 
should be done though drbdmanage as well? Can’t seem to find any complete 
documentation on drbdmanage besides the man-page.

Yours,
Chris
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

[DRBD-user] Recover node after hard-crash (drbd9)

Reply via email to