Hi, drbd90 kernel module version:9.0.22-2 drbd90-utils:9.12.2-1 kernel:3.10.0-1127.18.2.el7.x86_64 pacemaker:1.1.21-4 corosync-2.4.5-4 system is centos:7.6
I have a 4 node test system(only ever 1 active primary) which is going split-brain unexpectedly. n1 is the primary, n2/n3/n4 secondary. System is being shutdown every night and sometimes on restart(particularly after weekend shutdown) some of the nodes are split-brain and require a full resync to fix. Logs seem to indicate a problem with uuid_compare. >From the system log on n1:- Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: drbd_sync_handshake: Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: self 30D0D1B4BD67BAEE:CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 flags:20 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: peer 554921683EF7CC82:0000000000000000:272E3DE9D9C74A66:04B370F60768109E bits:0 flags:20 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: uuid_compare()=split-brain-disconnect by rule 100 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: helper command: /sbin/drbdadm initial-split-brain Then for n2:- Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: drbd_sync_handshake: Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: self 30D0D1B4BD67BAEE:CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 flags:20 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: peer BC13E2E36CA8B2C6:CE02E3A41E743EDA:272E3DE9D9C74A66:001E2864952E2E96 bits:416 flags:20 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: uuid_compare()=split-brain-auto-recover by rule 90 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm initial-split-brain Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: meta connection shut down by peer. Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: ack_receiver terminated Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating ack_recv thread Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm initial-split-brain exit code 0 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0: Split-Brain detected but unresolved, dropping connection! Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm split-brain Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm split-brain exit code 0 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( NetworkFailure -> Disconnecting ) Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: error receiving P_STATE, e: -5 l: 0! Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Restarting sender thread Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Connection closed Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Disconnecting -> StandAlone ) Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating receiver thread The logs also have FIXME messages(which may be unrelated) e.g:- Oct 13 12:50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op clear, bitmap locked for 'send_bitmap (WFBitMapS)' by drbd_w_r0[1659] Oct 13 12:50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op clear, bitmap locked for 'receive bitmap' by drbd_r_r0[95684] Sep 23 12:41:34 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[19978] op clear, bitmap locked for 'set_n_write from sync_handshake' by drbd_r_r0[17628] Regards, Jeremy Faith
_______________________________________________ Star us on GITHUB: https://github.com/LINBIT drbd-user mailing list [email protected] https://lists.linbit.com/mailman/listinfo/drbd-user
