[ClusterLabs] Pacemaker on-fail standby recovery does not start DRBD slave resource

Sam Gardner Wed, 30 Mar 2016 09:22:58 -0700

I have configured some network resources to automatically standby their node if 
the system detects a failure on them. However, the DRBD slave that I have 
configured does not automatically restart after the node is "unstandby-ed" 
after the failure-timeout expires.
Is there any way to make the "stopped" DRBDSlave resource automatically start 
again once the node is recovered?


See the  progression of events below:

Running cluster:
Wed Mar 30 16:04:20 UTC 2016
Cluster name:
Last updated: Wed Mar 30 16:04:20 2016
Last change: Wed Mar 30 16:03:24 2016
Stack: classic openais (with plugin)
Current DC: ha-d1.tw.com - partition with quorum
Version: 1.1.12-561c4cf
2 Nodes configured, 2 expected votes
7 Resources configured


Online: [ ha-d1.tw.com ha-d2.tw.com ]

Full list of resources:

 Resource Group: network
     inif       (ocf::custom:ip.sh):       Started ha-d1.tw.com
     outif      (ocf::custom:ip.sh):       Started ha-d1.tw.com
     dmz1       (ocf::custom:ip.sh):       Started ha-d1.tw.com
 Master/Slave Set: DRBDMaster [DRBDSlave]
     Masters: [ ha-d1.tw.com ]
     Slaves: [ ha-d2.tw.com ]
 Resource Group: filesystem
     DRBDFS     (ocf::heartbeat:Filesystem):    Started ha-d1.tw.com
 Resource Group: application
     service_failover   (ocf::custom:service_failover):    Started ha-d1.tw.com


version: 8.4.5 (api:1/proto:86-101)
srcversion: 315FB2BBD4B521D13C20BF4

 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:4 nr:0 dw:4 dr:757 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
[153766.565352] block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), 
RLE 21(1), total 21; compression: 100.0%
[153766.568303] block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), 
RLE 21(1), total 21; compression: 100.0%
[153766.568316] block drbd1: helper command: /sbin/drbdadm before-resync-source 
minor-1
[153766.568356] block drbd1: helper command: /sbin/drbdadm before-resync-source 
minor-1 exit code 255 (0xfffffffe)
[153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent 
-> Inconsistent )
[153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB [1 bits 
set]).
[153766.568444] block drbd1: updated sync UUID 
B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
[153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
[153766.577700] block drbd1: updated UUIDs 
B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952
[153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent 
-> UpToDate )¯

Failure detected:
Wed Mar 30 16:08:22 UTC 2016
Cluster name:
Last updated: Wed Mar 30 16:08:22 2016
Last change: Wed Mar 30 16:03:24 2016
Stack: classic openais (with plugin)
Current DC: ha-d1.tw.com - partition with quorum
Version: 1.1.12-561c4cf
2 Nodes configured, 2 expected votes
7 Resources configured


Node ha-d1.tw.com: standby (on-fail)
Online: [ ha-d2.tw.com ]

Full list of resources:

 Resource Group: network
     inif       (ocf::custom:ip.sh):       Started ha-d1.tw.com
     outif      (ocf::custom:ip.sh):       Started ha-d1.tw.com
     dmz1       (ocf::custom:ip.sh):       FAILED ha-d1.tw.com
 Master/Slave Set: DRBDMaster [DRBDSlave]
     Masters: [ ha-d1.tw.com ]
     Slaves: [ ha-d2.tw.com ]
 Resource Group: filesystem
     DRBDFS     (ocf::heartbeat:Filesystem):    Started ha-d1.tw.com
 Resource Group: application
     service_failover   (ocf::custom:service_failover):    Started ha-d1.tw.com

Failed actions:
    dmz1_monitor_7000 on ha-d1.tw.com 'not running' (7): call=156, 
status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', queued=0ms, exec=0ms



version: 8.4.5 (api:1/proto:86-101)
srcversion: 315FB2BBD4B521D13C20BF4

 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:4 nr:0 dw:4 dr:765 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
[153766.568356] block drbd1: helper command: /sbin/drbdadm before-resync-source 
minor-1 exit code 255 (0xfffffffe)
[153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent 
-> Inconsistent )
[153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB [1 bits 
set]).
[153766.568444] block drbd1: updated sync UUID 
B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
[153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
[153766.577700] block drbd1: updated UUIDs 
B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952
[153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent 
-> UpToDate )
[154057.455270] e1000: eth2 NIC Link is Down
[154057.455451] e1000 0000:02:02.0 eth2: Reset adapter

Failover complete:
Wed Mar 30 16:09:02 UTC 2016
Cluster name:
Last updated: Wed Mar 30 16:09:02 2016
Last change: Wed Mar 30 16:03:24 2016
Stack: classic openais (with plugin)
Current DC: ha-d1.tw.com - partition with quorum
Version: 1.1.12-561c4cf
2 Nodes configured, 2 expected votes
7 Resources configured


Node ha-d1.tw.com: standby (on-fail)
Online: [ ha-d2.tw.com ]

Full list of resources:

 Resource Group: network
     inif       (ocf::custom:ip.sh):       Started ha-d2.tw.com
     outif      (ocf::custom:ip.sh):       Started ha-d2.tw.com
     dmz1       (ocf::custom:ip.sh):       Started ha-d2.tw.com
 Master/Slave Set: DRBDMaster [DRBDSlave]
     Masters: [ ha-d2.tw.com ]
     Stopped: [ ha-d1.tw.com ]
 Resource Group: filesystem
     DRBDFS     (ocf::heartbeat:Filesystem):    Started ha-d2.tw.com
 Resource Group: application
     service_failover   (ocf::custom:service_failover):    Started ha-d2.tw.com

Failed actions:
    dmz1_monitor_7000 on ha-d1.tw.com 'not running' (7): call=156, 
status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', queued=0ms, exec=0ms



version: 8.4.5 (api:1/proto:86-101)
srcversion: 315FB2BBD4B521D13C20BF4
[154094.894524] drbd wwwdata: conn( Disconnecting -> StandAlone )
[154094.894525] drbd wwwdata: receiver terminated
[154094.894527] drbd wwwdata: Terminating drbd_r_wwwdata
[154094.894559] block drbd1: disk( UpToDate -> Failed )
[154094.894569] block drbd1: bitmap WRITE of 0 pages took 0 jiffies
[154094.894571] block drbd1: 4 KB (1 bits) marked out-of-sync by on disk 
bit-map.
[154094.894574] block drbd1: disk( Failed -> Diskless )
[154094.894647] block drbd1: drbd_bm_resize called with capacity == 0
[154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata

Standby node recovered, with DRBDSlave stopped (I want DRBDSlave started here):
Wed Mar 30 16:13:01 UTC 2016
Cluster name:
Last updated: Wed Mar 30 16:13:01 2016
Last change: Wed Mar 30 16:03:24 2016
Stack: classic openais (with plugin)
Current DC: ha-d1.tw.com - partition with quorum
Version: 1.1.12-561c4cf
2 Nodes configured, 2 expected votes
7 Resources configured


Online: [ ha-d1.tw.com ha-d2.tw.com ]

Full list of resources:

 Resource Group: network
     inif       (ocf::custom:ip.sh):       Started ha-d2.tw.com
     outif      (ocf::custom:ip.sh):       Started ha-d2.tw.com
     dmz1       (ocf::custom:ip.sh):       Started ha-d2.tw.com
 Master/Slave Set: DRBDMaster [DRBDSlave]
     Masters: [ ha-d2.tw.com ]
     Stopped: [ ha-d1.tw.com ]
 Resource Group: filesystem
     DRBDFS     (ocf::heartbeat:Filesystem):    Started ha-d2.tw.com
 Resource Group: application
     service_failover   (ocf::custom:service_failover):    Started ha-d2.tw.com


version: 8.4.5 (api:1/proto:86-101)
srcversion: 315FB2BBD4B521D13C20BF4
[154094.894574] block drbd1: disk( Failed -> Diskless )
[154094.894647] block drbd1: drbd_bm_resize called with capacity == 0
[154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata

--
Sam Gardner
Trustwave | SMART SECURITY ON DEMAND

________________________________

This transmission may contain information that is privileged, confidential, 
and/or exempt from disclosure under applicable law. If you are not the intended 
recipient, you are hereby notified that any disclosure, copying, distribution, 
or use of the information contained herein (including any reliance thereon) is 
strictly prohibited. If you received this transmission in error, please 
immediately contact the sender and destroy the material in its entirety, 
whether in electronic or hard copy format.

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker on-fail standby recovery does not start DRBD slave resource

Reply via email to