Hi,

drbd90 kernel module version:9.0.22-2
drbd90-utils:9.12.2-1
kernel:3.10.0-1127.18.2.el7.x86_64
pacemaker:1.1.21-4
corosync-2.4.5-4
system is centos:7.6

I have a 4 node test system(only ever 1 active primary) which is going 
split-brain unexpectedly.
n1 is the primary, n2/n3/n4 secondary.
System is being shutdown every night and sometimes on restart(particularly 
after weekend shutdown) some of the nodes are split-brain and require a full 
resync to fix.
Logs seem to indicate a problem with uuid_compare.

>From the system log on n1:-
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: drbd_sync_handshake:
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: self 
30D0D1B4BD67BAEE:CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 
flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: peer 
554921683EF7CC82:0000000000000000:272E3DE9D9C74A66:04B370F60768109E bits:0 
flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: 
uuid_compare()=split-brain-disconnect by rule 100
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: helper command: 
/sbin/drbdadm initial-split-brain

Then for n2:-
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: drbd_sync_handshake:
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: self 
30D0D1B4BD67BAEE:CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 
flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: peer 
BC13E2E36CA8B2C6:CE02E3A41E743EDA:272E3DE9D9C74A66:001E2864952E2E96 bits:416 
flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: 
uuid_compare()=split-brain-auto-recover by rule 90
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: 
/sbin/drbdadm initial-split-brain
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: meta connection shut down by 
peer.
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Connected -> 
NetworkFailure ) peer( Secondary -> Unknown )
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: ack_receiver terminated
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating ack_recv thread
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: 
/sbin/drbdadm initial-split-brain exit code 0
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0: Split-Brain detected but 
unresolved, dropping connection!
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: 
/sbin/drbdadm split-brain
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: 
/sbin/drbdadm split-brain exit code 0
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( NetworkFailure -> 
Disconnecting )
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: error receiving P_STATE, e: -5 
l: 0!
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Restarting sender thread
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Connection closed
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Disconnecting -> 
StandAlone )
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating receiver thread

The logs also have FIXME messages(which may be unrelated) e.g:-
Oct 13 12:50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op 
clear, bitmap locked for 'send_bitmap (WFBitMapS)' by drbd_w_r0[1659]

Oct 13 12:50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op 
clear, bitmap locked for 'receive bitmap' by drbd_r_r0[95684]

Sep 23 12:41:34 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[19978] op 
clear, bitmap locked for 'set_n_write from sync_handshake' by drbd_r_r0[17628]

Regards,
Jeremy Faith
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
https://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to