Hello, I have 2 node cluster (pacemaker) with master/slave DRBD resource.
The resource is configured with handlers: fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh"; unfence-peer "/usr/lib/drbd/crm-unfence-peer.9.sh"; If I shutdown the master node it gets fenced by handler and promotes the live node, see in log: Jul 3 07:41:45 val-test1-copy kernel: [ 308.983600] drbd odoo val-test2-copy: peer( Primary -> Secondary ) Jul 3 07:41:46 val-test1-copy kernel: [ 309.113294] drbd odoo val-test2-copy: Preparing remote state change 4162941012 Jul 3 07:41:46 val-test1-copy kernel: [ 309.113464] drbd odoo val-test2-copy: Committing remote state change 4162941012 (primary_nodes=0) Jul 3 07:41:46 val-test1-copy kernel: [ 309.113467] drbd odoo val-test2-copy: conn( Connected -> TearDown ) peer( Secondary ->Unknown ) Jul 3 07:41:46 val-test1-copy kernel: [ 309.113468] drbd odoo/0 drbd0 val-test2-copy: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) Jul 3 07:41:46 val-test1-copy kernel: [ 309.113479] drbd odoo val-test2-copy: ack_receiver terminated Jul 3 07:41:46 val-test1-copy kernel: [ 309.113480] drbd odoo val-test2-copy: Terminating ack_recv thread Jul 3 07:41:46 val-test1-copy kernel: [ 309.134608] drbd odoo val-test2-copy: Restarting sender thread Jul 3 07:41:46 val-test1-copy kernel: [ 309.145903] drbd odoo val-test2-copy: Connection closed Jul 3 07:41:46 val-test1-copy kernel: [ 309.145912] drbd odoo val-test2-copy: helper command: /sbin/drbdadm disconnected Jul 3 07:41:46 val-test1-copy kernel: [ 309.146759] drbd odoo val-test2-copy: helper command: /sbin/drbdadm disconnected exit code 0 Jul 3 07:41:46 val-test1-copy kernel: [ 309.146765] drbd odoo val-test2-copy: conn( TearDown -> Unconnected ) Jul 3 07:41:46 val-test1-copy kernel: [ 309.146770] drbd odoo val-test2-copy: Restarting receiver thread Jul 3 07:41:46 val-test1-copy kernel: [ 309.146771] drbd odoo val-test2-copy: conn( Unconnected -> Connecting ) Jul 3 07:41:47 val-test1-copy kernel: [ 310.357823] drbd odoo val-test2-copy: helper command: /sbin/drbdadm fence-peer Jul 3 07:41:27 val-test1-copy qemu-ga: message repeated 13 times: [ info: guest-ping called] Jul 3 07:41:47 val-test1-copy crm-fence-peer.9.sh[2907]: DRBD_BACKING_DEV_0=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1 DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connecting DRBD_LL_DISK=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1 DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=192.168.2.111 DRBD_MY_AF=ipv4 DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=val-test1-copy DRBD_NODE_ID_1=val-test2-copy DRBD_PEER_ADDRESS=192.168.2.112 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 DRBD_RESOURCE=odoo DRBD_VOLUME=0 UP_TO_DATE_NODES=0x00000001 /usr/lib/drbd/crm-fence-peer.9.sh Jul 3 07:41:47 val-test1-copy crm-fence-peer.9.sh[2907]: / Jul 3 07:41:47 val-test1-copy crm-fence-peer.9.sh[2907]: INFO peers are reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-odoo-ms_drbd_odoo' Jul 3 07:41:47 val-test1-copy kernel: [ 310.639973] drbd odoo val-test2-copy: helper command: /sbin/drbdadm fence-peer exit code 4 (0x400) Jul 3 07:41:47 val-test1-copy kernel: [ 310.639977] drbd odoo val-test2-copy: fence-peer helper returned 4 (peer was fenced) Jul 3 07:41:47 val-test1-copy kernel: [ 310.639980] drbd odoo/0 drbd0 val-test2-copy: pdsk( DUnknown -> Outdated ) Jul 3 07:41:47 val-test1-copy kernel: [ 310.639991] drbd odoo: Preparing cluster-wide state change 1652257927 (0->-1 3/1) Jul 3 07:41:47 val-test1-copy kernel: [ 310.639992] drbd odoo: Committing cluster-wide state change 1652257927 (0ms) Jul 3 07:41:47 val-test1-copy kernel: [ 310.639993] drbd odoo: role( Secondary -> Primary ) Jul 3 07:41:49 val-test1-copy kernel: [ 313.035751] drbd odoo/0 drbd0: new current UUID: E67F75EE576FC547 weak: FFFFFFFFFFFFFFFE Jul 3 07:41:50 val-test1-copy kernel: [ 313.058994] EXT4-fs (drbd0): mounted filesystem with ordered data mode. Opts: (null) So far perfect. Later i start the node and start pacemaker. The node joins as slave and starts replication, then removes the fence constraint from pacemaker (also perfect): Jul 3 07:51:13 val-test1-copy kernel: [ 876.831326] drbd odoo val-test2-copy: Handshake to peer 1 successful: Agreed network protocol version 118 Jul 3 07:51:13 val-test1-copy kernel: [ 876.831327] drbd odoo val-test2-copy: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jul 3 07:51:13 val-test1-copy kernel: [ 876.831494] drbd odoo val-test2-copy: Peer authenticated using 20 bytes HMAC Jul 3 07:51:13 val-test1-copy kernel: [ 876.831502] drbd odoo val-test2-copy: Starting ack_recv thread (from drbd_r_odoo [2018]) Jul 3 07:51:13 val-test1-copy kernel: [ 876.872236] drbd odoo: Preparing cluster-wide state change 2689784605 (0->1 499/145) Jul 3 07:51:13 val-test1-copy kernel: [ 876.888237] drbd odoo/0 drbd0 val-test2-copy: drbd_sync_handshake: Jul 3 07:51:13 val-test1-copy kernel: [ 876.888238] drbd odoo/0 drbd0 val-test2-copy: self E67F75EE576FC547:FA440AF0812F8348:2F6D8A307AFA432E:492AA1512F283D34 bits:44 flags:120 Jul 3 07:51:13 val-test1-copy kernel: [ 876.888241] drbd odoo/0 drbd0 val-test2-copy: peer FA440AF0812F8348:0000000000000000:492AA1512F283D34:03FA3B8A22BE10CC bits:0 flags:1020 Jul 3 07:51:13 val-test1-copy kernel: [ 876.888242] drbd odoo/0 drbd0 val-test2-copy: uuid_compare()=source-use-bitmap by rule70 Jul 3 07:51:15 val-test1-copy kernel: [ 878.341061] drbd odoo: State change 2689784605: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC Jul 3 07:51:15 val-test1-copy kernel: [ 878.341063] drbd odoo: Committing cluster-wide state change 2689784605 (1468ms) Jul 3 07:51:15 val-test1-copy kernel: [ 878.341077] drbd odoo val-test2-copy: conn( Connecting -> Connected ) peer( Unknown ->Secondary ) Jul 3 07:51:15 val-test1-copy kernel: [ 878.341078] drbd odoo/0 drbd0 val-test2-copy: pdsk( Outdated -> Consistent ) repl( Off-> WFBitMapS ) Jul 3 07:51:15 val-test1-copy kernel: [ 878.341331] drbd odoo/0 drbd0 val-test2-copy: pdsk( Consistent -> Outdated ) Jul 3 07:51:15 val-test1-copy kernel: [ 878.348293] drbd odoo/0 drbd0 val-test2-copy: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 47(1), total 47; compression: 100.0% Jul 3 07:51:15 val-test1-copy kernel: [ 878.367674] drbd odoo/0 drbd0 val-test2-copy: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 47(1), total 47; compression: 100.0% Jul 3 07:51:15 val-test1-copy kernel: [ 878.367686] drbd odoo/0 drbd0 val-test2-copy: helper command: /sbin/drbdadm before-resync-source Jul 3 07:51:15 val-test1-copy kernel: [ 878.368492] drbd odoo/0 drbd0 val-test2-copy: helper command: /sbin/drbdadm before-resync-source exit code 0 Jul 3 07:51:15 val-test1-copy kernel: [ 878.368499] drbd odoo/0 drbd0 val-test2-copy: pdsk( Outdated -> Inconsistent ) repl( WFBitMapS -> SyncSource ) Jul 3 07:51:15 val-test1-copy kernel: [ 878.368523] drbd odoo/0 drbd0 val-test2-copy: Began resync as SyncSource (will sync 176 KB [44 bits set]). Jul 3 07:51:15 val-test1-copy kernel: [ 878.497928] drbd odoo/0 drbd0 val-test2-copy: updated UUIDs E67F75EE576FC547:0000000000000000:2F6D8A307AFA432E:492AA1512F283D34 Jul 3 07:51:15 val-test1-copy kernel: [ 878.515002] drbd odoo/0 drbd0 val-test2-copy: Resync done (total 1 sec; paused 0 sec; 176 K/sec) Jul 3 07:51:15 val-test1-copy kernel: [ 878.515006] drbd odoo/0 drbd0 val-test2-copy: pdsk( Inconsistent -> UpToDate ) repl( SyncSource -> Established ) Jul 3 07:51:15 val-test1-copy kernel: [ 878.533682] drbd odoo val-test2-copy: helper command: /sbin/drbdadm unfence-peer Jul 3 07:51:15 val-test1-copy crm-unfence-peer.9.sh[11907]: DRBD_BACKING_DEV_0=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1 DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connected DRBD_LL_DISK=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1 DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=192.168.2.111 DRBD_MY_AF=ipv4 DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=val-test1-copy DRBD_NODE_ID_1=val-test2-copy DRBD_PEER_ADDRESS=192.168.2.112 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 DRBD_RESOURCE=odoo DRBD_VOLUME=0 UP_TO_DATE_NODES=0x00000003 /usr/lib/drbd/crm-unfence-peer.9.sh Jul 3 07:51:15 val-test1-copy crm-unfence-peer.9.sh[11907]: INFO Removed constraint 'drbd-fence-by-handler-odoo-ms_drbd_odoo' Jul 3 07:51:15 val-test1-copy kernel: [ 878.582275] drbd odoo val-test2-copy: helper command: /sbin/drbdadm unfence-peer exit code 0 Now the problem - i got message "PingAck did not arrive in time" in kernel i guess a network issue. The slave node was fenced as it should be. Log from master node: Jul 3 22:05:35 val-test1-copy kernel: [50518.748265] drbd odoo val-test2-copy: PingAck did not arrive in time. Jul 3 22:05:35 val-test1-copy kernel: [50518.748319] drbd odoo val-test2-copy: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown ) Jul 3 22:05:35 val-test1-copy kernel: [50518.748320] drbd odoo/0 drbd0: disk( UpToDate -> Consistent ) Jul 3 22:05:35 val-test1-copy kernel: [50518.748322] drbd odoo/0 drbd0 val-test2-copy: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) Jul 3 22:05:35 val-test1-copy kernel: [50518.748339] drbd odoo val-test2-copy: ack_receiver terminated Jul 3 22:05:35 val-test1-copy kernel: [50518.748342] drbd odoo val-test2-copy: Terminating ack_recv thread Jul 3 22:05:35 val-test1-copy kernel: [50518.785064] drbd odoo val-test2-copy: Restarting sender thread Jul 3 22:05:36 val-test1-copy kernel: [50519.533938] drbd odoo: Preparing cluster-wide state change 1241314843 (0->-1 0/0) Jul 3 22:05:36 val-test1-copy kernel: [50519.533961] drbd odoo: Committing cluster-wide state change 1241314843 (0ms) Jul 3 22:05:36 val-test1-copy kernel: [50519.533964] drbd odoo/0 drbd0: disk( Consistent -> UpToDate ) Jul 3 22:05:36 val-test1-copy kernel: [50519.611815] drbd odoo val-test2-copy: Connection closed Jul 3 22:05:36 val-test1-copy kernel: [50519.611823] drbd odoo val-test2-copy: helper command: /sbin/drbdadm disconnected Jul 3 22:05:36 val-test1-copy kernel: [50519.612622] drbd odoo val-test2-copy: helper command: /sbin/drbdadm disconnected exit code 0 Jul 3 22:05:36 val-test1-copy kernel: [50519.612627] drbd odoo val-test2-copy: conn( NetworkFailure -> Unconnected ) Jul 3 22:05:36 val-test1-copy kernel: [50519.612631] drbd odoo val-test2-copy: Restarting receiver thread Jul 3 22:05:36 val-test1-copy kernel: [50519.612633] drbd odoo val-test2-copy: conn( Unconnected -> Connecting ) Jul 3 22:05:36 val-test1-copy kernel: [50520.169410] drbd odoo val-test2-copy: Handshake to peer 1 successful: Agreed network protocol version 118 Jul 3 22:05:36 val-test1-copy kernel: [50520.169411] drbd odoo val-test2-copy: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jul 3 22:05:36 val-test1-copy kernel: [50520.169582] drbd odoo val-test2-copy: Peer authenticated using 20 bytes HMAC Jul 3 22:05:36 val-test1-copy kernel: [50520.169588] drbd odoo val-test2-copy: Starting ack_recv thread (from drbd_r_odoo [2037]) Jul 3 22:05:36 val-test1-copy kernel: [50520.229035] drbd odoo: Preparing cluster-wide state change 3809390519 (0->1 499/146) Jul 3 22:05:36 val-test1-copy kernel: [50520.245020] drbd odoo/0 drbd0 val-test2-copy: drbd_sync_handshake: Jul 3 22:05:36 val-test1-copy kernel: [50520.245021] drbd odoo/0 drbd0 val-test2-copy: self FAD559D0085B1D10:0000000000000000:2F6D8A307AFA432E:492AA1512F283D34 bits:0 flags:120 Jul 3 22:05:36 val-test1-copy kernel: [50520.245022] drbd odoo/0 drbd0 val-test2-copy: peer FAD559D0085B1D11:0000000000000000:492AA1512F283D34:03FA3B8A22BE10CC bits:0 flags:1120 Jul 3 22:05:36 val-test1-copy kernel: [50520.245024] drbd odoo/0 drbd0 val-test2-copy: uuid_compare()=no-sync by rule 38 Jul 3 22:05:36 val-test1-copy kernel: [50520.248276] drbd odoo: State change 3809390519: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC Jul 3 22:05:36 val-test1-copy kernel: [50520.248278] drbd odoo: Committing cluster-wide state change 3809390519 (16ms) Jul 3 22:05:36 val-test1-copy kernel: [50520.248290] drbd odoo val-test2-copy: conn( Connecting -> Connected ) peer( Unknown ->Primary ) Jul 3 22:05:36 val-test1-copy kernel: [50520.248291] drbd odoo/0 drbd0 val-test2-copy: pdsk( DUnknown -> UpToDate ) repl( Off -> Established ) Log from slave node (fence-peer handler executed and constraint set in pacemaker): Jul 3 22:05:35 val-test2-copy kernel: [50697.130833] drbd odoo val-test1-copy: sock was shut down by peer Jul 3 22:05:35 val-test2-copy kernel: [50697.130846] drbd odoo val-test1-copy: conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown ) Jul 3 22:05:35 val-test2-copy kernel: [50697.130848] drbd odoo/0 drbd0 val-test1-copy: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) Jul 3 22:05:35 val-test2-copy kernel: [50697.130874] drbd odoo val-test1-copy: ack_receiver terminated Jul 3 22:05:35 val-test2-copy kernel: [50697.130877] drbd odoo val-test1-copy: Terminating ack_recv thread Jul 3 22:05:35 val-test2-copy kernel: [50697.156086] drbd odoo val-test1-copy: Restarting sender thread Jul 3 22:05:36 val-test2-copy kernel: [50697.867155] drbd odoo val-test1-copy: Connection closed Jul 3 22:05:36 val-test2-copy kernel: [50697.867218] drbd odoo val-test1-copy: helper command: /sbin/drbdadm disconnected Jul 3 22:05:36 val-test2-copy kernel: [50697.867221] drbd odoo val-test1-copy: helper command: /sbin/drbdadm fence-peer Jul 3 22:05:36 val-test2-copy kernel: [50697.867998] drbd odoo val-test1-copy: helper command: /sbin/drbdadm disconnected exit code 0 Jul 3 22:05:36 val-test2-copy kernel: [50697.868005] drbd odoo val-test1-copy: conn( BrokenPipe -> Unconnected ) Jul 3 22:05:36 val-test2-copy kernel: [50697.868010] drbd odoo val-test1-copy: Restarting receiver thread Jul 3 22:05:36 val-test2-copy kernel: [50697.868012] drbd odoo val-test1-copy: conn( Unconnected -> Connecting ) Jul 3 22:05:36 val-test2-copy crm-fence-peer.9.sh[14362]: DRBD_BACKING_DEV_0=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1 DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=BrokenPipe DRBD_LL_DISK=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1 DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=192.168.2.112 DRBD_MY_AF=ipv4 DRBD_MY_NODE_ID=1 DRBD_NODE_ID_0=val-test1-copy DRBD_NODE_ID_1=val-test2-copy DRBD_PEER_ADDRESS=192.168.2.111 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=0 DRBD_RESOURCE=odoo DRBD_VOLUME=0 UP_TO_DATE_NODES =0x00000002 /usr/lib/drbd/crm-fence-peer.9.sh Jul 3 22:05:36 val-test2-copy crm-fence-peer.9.sh[14362]: / Jul 3 22:05:36 val-test2-copy crm-fence-peer.9.sh[14362]: INFO peers are reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-odoo-ms_drbd_odoo' Jul 3 22:05:36 val-test2-copy kernel: [50698.484443] drbd odoo val-test1-copy: helper command: /sbin/drbdadm fence-peer exit code 4 (0x400) Jul 3 22:05:36 val-test2-copy kernel: [50698.484448] drbd odoo val-test1-copy: fence-peer helper returned 4 (peer was fenced) Jul 3 22:05:36 val-test2-copy kernel: [50698.484450] drbd odoo/0 drbd0 val-test1-copy: pdsk( DUnknown -> Outdated ) Jul 3 22:05:36 val-test2-copy kernel: [50698.539142] drbd odoo val-test1-copy: Handshake to peer 0 successful: Agreed network protocol version 118 Jul 3 22:05:36 val-test2-copy kernel: [50698.539144] drbd odoo val-test1-copy: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jul 3 22:05:36 val-test2-copy kernel: [50698.539297] drbd odoo val-test1-copy: Peer authenticated using 20 bytes HMAC Jul 3 22:05:36 val-test2-copy kernel: [50698.539305] drbd odoo val-test1-copy: Starting ack_recv thread (from drbd_r_odoo [1980]) Jul 3 22:05:36 val-test2-copy kernel: [50698.598830] drbd odoo val-test1-copy: Preparing remote state change 3809390519 Jul 3 22:05:36 val-test2-copy kernel: [50698.617380] drbd odoo/0 drbd0 val-test1-copy: drbd_sync_handshake: Jul 3 22:05:36 val-test2-copy kernel: [50698.617383] drbd odoo/0 drbd0 val-test1-copy: self FAD559D0085B1D11:0000000000000000:492AA1512F283D34:03FA3B8A22BE10CC bits:0 flags:120 Jul 3 22:05:36 val-test2-copy kernel: [50698.617384] drbd odoo/0 drbd0 val-test1-copy: peer FAD559D0085B1D10:0000000000000000:2F6D8A307AFA432E:492AA1512F283D34 bits:0 flags:1120 Jul 3 22:05:36 val-test2-copy kernel: [50698.617386] drbd odoo/0 drbd0 val-test1-copy: uuid_compare()=no-sync by rule 38 Jul 3 22:05:36 val-test2-copy kernel: [50698.618018] drbd odoo val-test1-copy: Committing remote state change 3809390519 (primary_nodes=2) Jul 3 22:05:36 val-test2-copy kernel: [50698.618023] drbd odoo val-test1-copy: conn( Connecting -> Connected ) peer( Unknown ->Secondary ) Jul 3 22:05:36 val-test2-copy kernel: [50698.618024] drbd odoo/0 drbd0 val-test1-copy: pdsk( Outdated -> UpToDate ) repl( Off -> Established ) My question is: Why it did not trigger unfence handler after connection was recovered and replication established? The fence constraint stays in pacemaker but 2 nodes are up to date.
_______________________________________________ Star us on GITHUB: https://github.com/LINBIT drbd-user mailing list drbd-user@lists.linbit.com https://lists.linbit.com/mailman/listinfo/drbd-user