Re: [ClusterLabs] Pacemaker on-fail standby recovery does not start DRBD slave resource

2016-03-30 Thread Sam Gardner
One other note: Manually standby-ing and unstandby-ing a node gives the
behavior I want (eg, after the node is unstandby-ed, the DRBDSlave
resource works).
--
Sam Gardner
Trustwave | SMART SECURITY ON DEMAND


On 3/30/16, 11:46 AM, "Ken Gaillot"  wrote:

>On 03/30/2016 11:20 AM, Sam Gardner wrote:
>> I have configured some network resources to automatically standby their
>>node if the system detects a failure on them. However, the DRBD slave
>>that I have configured does not automatically restart after the node is
>>"unstandby-ed" after the failure-timeout expires.
>> Is there any way to make the "stopped" DRBDSlave resource automatically
>>start again once the node is recovered?
>>
>> See the  progression of events below:
>>
>> Running cluster:
>> Wed Mar 30 16:04:20 UTC 2016
>> Cluster name:
>> Last updated: Wed Mar 30 16:04:20 2016
>> Last change: Wed Mar 30 16:03:24 2016
>> Stack: classic openais (with plugin)
>> Current DC:
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>AFrSWgtww=5=http%3a%2f%2fha-d1%2etw%2ecom - partition with quorum
>> Version: 1.1.12-561c4cf
>> 2 Nodes configured, 2 expected votes
>> 7 Resources configured
>>
>>
>> Online: [
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>AFrSWgtww=5=http%3a%2f%2fha-d1%2etw%2ecom
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>FVqRWF9lw=5=http%3a%2f%2fha-d2%2etw%2ecom ]
>>
>> Full list of resources:
>>
>>  Resource Group: network
>>  inif   (ocf::custom:ip.sh):   Started
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>AFrSWgtww=5=http%3a%2f%2fha-d1%2etw%2ecom
>>  outif  (ocf::custom:ip.sh):   Started
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>AFrSWgtww=5=http%3a%2f%2fha-d1%2etw%2ecom
>>  dmz1   (ocf::custom:ip.sh):   Started
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>AFrSWgtww=5=http%3a%2f%2fha-d1%2etw%2ecom
>>  Master/Slave Set: DRBDMaster [DRBDSlave]
>>  Masters: [
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>AFrSWgtww=5=http%3a%2f%2fha-d1%2etw%2ecom ]
>>  Slaves: [
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>FVqRWF9lw=5=http%3a%2f%2fha-d2%2etw%2ecom ]
>>  Resource Group: filesystem
>>  DRBDFS (ocf::heartbeat:Filesystem):Started
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>AFrSWgtww=5=http%3a%2f%2fha-d1%2etw%2ecom
>>  Resource Group: application
>>  service_failover   (ocf::custom:service_failover):Started
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>AFrSWgtww=5=http%3a%2f%2fha-d1%2etw%2ecom
>>
>>
>> version: 8.4.5 (api:1/proto:86-101)
>> srcversion: 315FB2BBD4B521D13C20BF4
>>
>>  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-
>> ns:4 nr:0 dw:4 dr:757 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>> [153766.565352] block drbd1: send bitmap stats [Bytes(packets)]: plain
>>0(0), RLE 21(1), total 21; compression: 100.0%
>> [153766.568303] block drbd1: receive bitmap stats [Bytes(packets)]:
>>plain 0(0), RLE 21(1), total 21; compression: 100.0%
>> [153766.568316] block drbd1: helper command: /sbin/drbdadm
>>before-resync-source minor-1
>> [153766.568356] block drbd1: helper command: /sbin/drbdadm
>>before-resync-source minor-1 exit code 255 (0xfffe)
>> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk(
>>Consistent -> Inconsistent )
>> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB
>>[1 bits set]).
>> [153766.568444] block drbd1: updated sync UUID
>>B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
>> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4
>>K/sec)
>> [153766.577700] block drbd1: updated UUIDs
>>B0DA745C79C56591::36E0631B6F022952:36DF631B6F022952
>> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk(
>>Inconsistent -> UpToDate )¯
>>
>> Failure detected:
>> Wed Mar 30 16:08:22 UTC 2016
>> Cluster name:
>> Last updated: Wed Mar 30 16:08:22 2016
>> Last change: Wed Mar 30 16:03:24 2016
>> Stack: classic openais (with plugin)
>> Current DC:
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>AFrSWgtww=5=http%3a%2f%2fha-d1%2etw%2ecom - partition with quorum
>> Version: 1.1.12-561c4cf
>> 2 Nodes configured, 2 expected votes
>> 7 Resources configured
>>
>>
>> Node ha-d1.tw.com: standby (on-fail)
>> Online: [
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>FVqRWF9lw=5=http%3a%2f%2fha-d2%2etw%2ecom ]
>>
>> Full list of resources:
>>
>>  Resource Group: network
>>  inif   (ocf::custom:ip.sh):   Started
>>http://scanmail.trustwave.com/?c=4062=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh
>>AFrSWgtww=5=http%3a%2f%2fha-d1%2etw%2ecom
>>  outif  (ocf::custom:ip.sh):   

Re: [ClusterLabs] Pacemaker on-fail standby recovery does not start DRBD slave resource

2016-03-30 Thread Ken Gaillot
On 03/30/2016 11:20 AM, Sam Gardner wrote:
> I have configured some network resources to automatically standby their node 
> if the system detects a failure on them. However, the DRBD slave that I have 
> configured does not automatically restart after the node is "unstandby-ed" 
> after the failure-timeout expires.
> Is there any way to make the "stopped" DRBDSlave resource automatically start 
> again once the node is recovered?
> 
> See the  progression of events below:
> 
> Running cluster:
> Wed Mar 30 16:04:20 UTC 2016
> Cluster name:
> Last updated: Wed Mar 30 16:04:20 2016
> Last change: Wed Mar 30 16:03:24 2016
> Stack: classic openais (with plugin)
> Current DC: ha-d1.tw.com - partition with quorum
> Version: 1.1.12-561c4cf
> 2 Nodes configured, 2 expected votes
> 7 Resources configured
> 
> 
> Online: [ ha-d1.tw.com ha-d2.tw.com ]
> 
> Full list of resources:
> 
>  Resource Group: network
>  inif   (ocf::custom:ip.sh):   Started ha-d1.tw.com
>  outif  (ocf::custom:ip.sh):   Started ha-d1.tw.com
>  dmz1   (ocf::custom:ip.sh):   Started ha-d1.tw.com
>  Master/Slave Set: DRBDMaster [DRBDSlave]
>  Masters: [ ha-d1.tw.com ]
>  Slaves: [ ha-d2.tw.com ]
>  Resource Group: filesystem
>  DRBDFS (ocf::heartbeat:Filesystem):Started ha-d1.tw.com
>  Resource Group: application
>  service_failover   (ocf::custom:service_failover):Started 
> ha-d1.tw.com
> 
> 
> version: 8.4.5 (api:1/proto:86-101)
> srcversion: 315FB2BBD4B521D13C20BF4
> 
>  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-
> ns:4 nr:0 dw:4 dr:757 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> [153766.565352] block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), 
> RLE 21(1), total 21; compression: 100.0%
> [153766.568303] block drbd1: receive bitmap stats [Bytes(packets)]: plain 
> 0(0), RLE 21(1), total 21; compression: 100.0%
> [153766.568316] block drbd1: helper command: /sbin/drbdadm 
> before-resync-source minor-1
> [153766.568356] block drbd1: helper command: /sbin/drbdadm 
> before-resync-source minor-1 exit code 255 (0xfffe)
> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent 
> -> Inconsistent )
> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB [1 
> bits set]).
> [153766.568444] block drbd1: updated sync UUID 
> B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
> [153766.577700] block drbd1: updated UUIDs 
> B0DA745C79C56591::36E0631B6F022952:36DF631B6F022952
> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( 
> Inconsistent -> UpToDate )¯
> 
> Failure detected:
> Wed Mar 30 16:08:22 UTC 2016
> Cluster name:
> Last updated: Wed Mar 30 16:08:22 2016
> Last change: Wed Mar 30 16:03:24 2016
> Stack: classic openais (with plugin)
> Current DC: ha-d1.tw.com - partition with quorum
> Version: 1.1.12-561c4cf
> 2 Nodes configured, 2 expected votes
> 7 Resources configured
> 
> 
> Node ha-d1.tw.com: standby (on-fail)
> Online: [ ha-d2.tw.com ]
> 
> Full list of resources:
> 
>  Resource Group: network
>  inif   (ocf::custom:ip.sh):   Started ha-d1.tw.com
>  outif  (ocf::custom:ip.sh):   Started ha-d1.tw.com
>  dmz1   (ocf::custom:ip.sh):   FAILED ha-d1.tw.com
>  Master/Slave Set: DRBDMaster [DRBDSlave]
>  Masters: [ ha-d1.tw.com ]
>  Slaves: [ ha-d2.tw.com ]
>  Resource Group: filesystem
>  DRBDFS (ocf::heartbeat:Filesystem):Started ha-d1.tw.com
>  Resource Group: application
>  service_failover   (ocf::custom:service_failover):Started 
> ha-d1.tw.com
> 
> Failed actions:
> dmz1_monitor_7000 on ha-d1.tw.com 'not running' (7): call=156, 
> status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', queued=0ms, 
> exec=0ms
> 
> 
> 
> version: 8.4.5 (api:1/proto:86-101)
> srcversion: 315FB2BBD4B521D13C20BF4
> 
>  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-
> ns:4 nr:0 dw:4 dr:765 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> [153766.568356] block drbd1: helper command: /sbin/drbdadm 
> before-resync-source minor-1 exit code 255 (0xfffe)
> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent 
> -> Inconsistent )
> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB [1 
> bits set]).
> [153766.568444] block drbd1: updated sync UUID 
> B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
> [153766.577700] block drbd1: updated UUIDs 
> B0DA745C79C56591::36E0631B6F022952:36DF631B6F022952
> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( 
> Inconsistent -> UpToDate )
> [154057.455270] e1000: eth2 NIC Link is Down
> [154057.455451] e1000 :02:02.0 eth2: Reset adapter
> 
> Failover complete:
> Wed Mar