Hi Felix, 

Thanks for the suggestion. I have removed all of the target-* parameters, 
however this behavior is still present when failing over with the DRBD device. 
I did notice the following line in the log when attempting to failover: 

Jan 30 15:11:15 node2 lrmd: [1440]: info: RA output: 
(p_drbd_mount2:1:notify:stdout) drbdsetup 1 syncer --set-defaults 
--create-device --rate=1000M 
Jan 30 15:11:15 node2 crmd: [1443]: info: send_direct_ack: ACK'ing resource op 
p_drbd_mount2:1_notify_0 from 120:428:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e: 
lrm_invoke-lrmd-1327957875-58 
Jan 30 15:11:15 node2 crmd: [1443]: info: process_lrm_event: LRM operation 
p_drbd_mount2:1_notify_0 (call=79, rc=0, cib-update=0, confirmed=true) ok 
Jan 30 15:11:15 node2 kernel: [ 1075.519415] block drbd0: conn( WFBitMapS -> 
SyncSource ) pdsk( Outdated -> Inconsistent ) 
Jan 30 15:11:15 node2 kernel: [ 1075.519441] block drbd0: Began resync as 
SyncSource (will sync 0 KB [0 bits set]). 
Jan 30 15:11:15 node2 kernel: [ 1075.792404] block drbd1: conn( WFBitMapT -> 
WFSyncUUID ) 
Jan 30 15:11:15 node2 kernel: [ 1075.806756] block drbd1: helper command: 
/sbin/drbdadm before-resync-target minor-1 
Jan 30 15:11:15 node2 kernel: [ 1075.809927] block drbd1: helper command: 
/sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0) 
Jan 30 15:11:15 node2 kernel: [ 1075.809934] block drbd1: conn( WFSyncUUID -> 
SyncTarget ) 
Jan 30 15:11:15 node2 kernel: [ 1075.809942] block drbd1: Began resync as 
SyncTarget (will sync 838514728 KB [209628682 bits set]). 
Jan 30 15:11:16 node2 kernel: [ 1076.060191] block drbd0: Resync done (total 1 
sec; paused 0 sec; 0 K/sec) 
Jan 30 15:11:16 node2 kernel: [ 1076.060200] block drbd0: conn( SyncSource -> 
Connected ) pdsk( Inconsistent -> UpToDate ) 
Jan 30 15:11:16 node2 crmd: [1443]: info: do_lrm_rsc_op: Performing 
key=135:429:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e op=p_drbd_mount2:1_notify_0 
) 
Jan 30 15:11:16 node2 lrmd: [1440]: info: rsc:p_drbd_mount2:1 notify[81] (pid 
6052) 
Jan 30 15:11:16 node2 lrmd: [1440]: info: operation notify[81] on 
p_drbd_mount2:1 for client 1443: pid 6052 exited with return code 0 
Jan 30 15:11:16 node2 crmd: [1443]: info: send_direct_ack: ACK'ing resource op 
p_drbd_mount2:1_notify_0 from 135:429:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e: 
lrm_invoke-lrmd-1327957876-59 
Jan 30 15:11:16 node2 crmd: [1443]: info: process_lrm_event: LRM operation 
p_drbd_mount2:1_notify_0 (call=81, rc=0, cib-update=0, confirmed=true) ok 
Jan 30 15:11:24 node2 attrd: [1442]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-p_drbd_mount2:1 (10) 
Jan 30 15:11:24 node2 attrd: [1442]: notice: attrd_perform_update: Sent update 
217: master-p_drbd_mount2:1=10 
Jan 30 15:11:38 node2 kernel: [ 1097.979353] block drbd0: role( Secondary -> 
Primary ) 


Jan 30 15:11:55 node2 lrmd: [1440]: info: RA output: 
(p_drbd_mount1:0:monitor:stderr) lock on /var/lock/drbd-147-0 currently held by 
pid:6164 


Jan 30 15:11:55 node2 mount2_attribute: [6214]: info: Invoked: mount2_attribute 
-N node2 -n master-p_drbd_mount1:0 -l reboot -D 
Jan 30 15:11:55 node2 attrd: [1442]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-p_drbd_mount1:0 (<null>) 
Jan 30 15:11:55 node2 attrd: [1442]: notice: attrd_perform_update: Sent delete 
221: node=039e53da-dce8-4fd7-84bc-7261682529e8, attr=master-p_drbd_mount1:0, 
id=<n/a>, set=(null), section=status 
Jan 30 15:11:55 node2 crmd: [1443]: info: process_lrm_event: LRM operation 
p_drbd_mount1:0_monitor_30000 (call=74, rc=7, cib-update=114, confirmed=false) 
not running 
Jan 30 15:11:55 node2 attrd: [1442]: notice: attrd_perform_update: Sent delete 
223: node=039e53da-dce8-4fd7-84bc-7261682529e8, attr=master-p_drbd_mount1:0, 
id=<n/a>, set=(null), section=status 
Jan 30 15:11:55 node2 kernel: [ 1115.843923] block drbd1: role( Secondary -> 
Primary ) 
Jan 30 15:11:55 node2 crmd: [1443]: info: process_lrm_event: LRM operation 
p_drbd_mount2:1_monitor_30000 (call=24, rc=8, cib-update=115, confirmed=false) 
master 
Jan 30 15:11:56 node2 attrd: [1442]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-p_drbd_mount1:0 (1) 
Jan 30 15:11:56 node2 crmd: [1443]: info: do_lrm_rsc_op: Performing 
key=108:434:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e op=p_drbd_mount1:0_notify_0 
) 
Jan 30 15:11:56 node2 lrmd: [1440]: info: rsc:p_drbd_mount1:0 notify[82] (pid 
6228) 
Jan 30 15:11:56 node2 attrd: [1442]: notice: attrd_perform_update: Sent update 
226: fail-count-p_drbd_mount1:0=1 
Jan 30 15:11:56 node2 attrd: [1442]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-p_drbd_mount1:0 (1327957916) 
Jan 30 15:11:56 node2 attrd: [1442]: notice: attrd_perform_update: Sent update 
228: last-failure-p_drbd_mount1:0=1327957916 
Jan 30 15:11:56 node2 lrmd: [1440]: info: operation notify[82] on 
p_drbd_mount1:0 for client 1443: pid 6228 exited with return code 0 
Jan 30 15:11:56 node2 crmd: [1443]: info: send_direct_ack: ACK'ing resource op 
p_drbd_mount1:0_notify_0 from 108:434:0:f84ff0aa-9a17-4b66-954d-8c3011a3441e: 
lrm_invoke-lrmd-1327957916-60 
Jan 30 15:11:56 node2 crmd: [1443]: info: process_lrm_event: LRM operation 
p_drbd_mount1:0_notify_0 (call=82, rc=0, cib-update=0, confirmed=true) ok 
Jan 30 15:11:57 node2 attrd: [1442]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-p_drbd_mount2:1 (1) 
Jan 30 15:11:57 node2 attrd: [1442]: notice: attrd_perform_update: Sent update 
231: fail-count-p_drbd_mount2:1=1 
Jan 30 15:11:57 node2 attrd: [1442]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-p_drbd_mount2:1 (1327957916) 
Jan 30 15:11:57 node2 attrd: [1442]: notice: attrd_perform_update: Sent update 
234: last-failure-p_drbd_mount2:1=1327957916 


I was unable to trace the 6164 process as it had already terminated when I 
looked for it. The node simply stays as slave, even though it is the only 
online node. Attempting to cleanup the ms_drbd_mount1 primitive does not 
resolve the situation, however I could set the DRBD device as primary using 
drbdadm primary mount1. After doing so I could then use cleanup ms_drbd_mount1 
and it resumed bringing up the rest of the resources successfully. 


I also noticed this in the log: 

Jan 30 15:11:59 node2 kernel: [ 1119.882808] block drbd0: role( Primary -> 
Secondary ) 
Jan 30 15:11:59 node2 lrmd: [1440]: info: RA output: 
(p_drbd_mount1:0:stop:stdout) 
Jan 30 15:11:59 node2 kernel: [ 1119.960543] block drbd0: peer( Secondary -> 
Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
Jan 30 15:11:59 node2 kernel: [ 1119.960589] block drbd0: short read expecting 
header on sock: r=-512 
Jan 30 15:11:59 node2 kernel: [ 1119.968719] block drbd0: asender terminated 
Jan 30 15:11:59 node2 kernel: [ 1119.968768] block drbd0: Terminating asender 
thread 
Jan 30 15:12:00 node2 kernel: [ 1119.969386] block drbd0: Connection closed 
Jan 30 15:12:00 node2 kernel: [ 1119.969421] block drbd0: conn( Disconnecting 
-> StandAlone ) 
Jan 30 15:12:00 node2 kernel: [ 1119.969467] block drbd0: receiver terminated 
Jan 30 15:12:00 node2 kernel: [ 1119.969470] block drbd0: Terminating receiver 
thread 
Jan 30 15:12:00 node2 kernel: [ 1119.969634] block drbd0: disk( UpToDate -> 
Diskless ) 
Jan 30 15:12:00 node2 kernel: [ 1119.970702] block drbd0: drbd_bm_resize called 
with capacity == 0 
Jan 30 15:12:00 node2 kernel: [ 1119.971970] block drbd0: worker terminated 
Jan 30 15:12:00 node2 kernel: [ 1119.971973] block drbd0: Terminating worker 
thread 
Jan 30 15:12:00 node2 lrmd: [1440]: info: RA output: 
(p_drbd_mount1:0:stop:stdout) 
Jan 30 15:12:01 node2 mount2_attribute: [6433]: info: Invoked: mount2_attribute 
-N node2 -n master-p_drbd_mount1:0 -l reboot -D 


I'm not sure if the "short read expecting header on sock" error is related to 
why it can't promote the resource, but this doesn't seem to stop the drbdadm 
utility from doing it. Is there some pacemaker utility that gives more verbose 
information on why a resource fails to start? 


What does the digit after the resource indicate, e.g. the :1 or :1_stop_0 
below: 

Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] 
p_drbd_mount2:1 (ocf::linbit:drbd): Master node2 (unmanaged) FAILED 
Slaves: [ node1 ] 

Failed actions: 
p_drbd_mount2:1_stop_0 (node=node2, call=94, rc=-2, status=Timed Out): unknown 
exec error 

Thanks, 


Andrew 
----- Original Message -----

From: "Felix Frank" < [email protected] > 
To: "Andrew Martin" < [email protected] > 
Cc: "drbd-user" < [email protected] > 
Sent: Monday, January 30, 2012 3:07:08 AM 
Subject: Re: [DRBD-user] Removing DRBD Kernel Module Blocks 

Hi, 

On 01/29/2012 03:03 PM, Lars Ellenberg wrote: 
>>> I don't think DRBD should attempt to become primary when 
>>> > > you issue a stop command/ 
> And it does not. It is the *peer* that is "refusing" here, 
> hinting to "self" that I should Outdate myself when voluntarily 
> disconnecting from a peer in Primary role. 

D'oh! 

On 01/27/2012 11:19 PM, Andrew Martin wrote: 
> However, shouldn't it have migrated already when that node went offline? 
> How can I what is preventing the DRBD resource from being promoted? 
... 
> I've attached my configuration (as outputted by crm configure show). 

Hum, difficult. I can't really tell the problem from what you posted. 

>From experience: Your config contains lots of "target-role" etc. 
parameters (probably from issuing crm resource start / promote ...). 
I have known some of these to prevent clean failovers under specific 
circumstances. 

You may want to try and get rid of all those target-* parameters as a 
first try. 

HTH, 
Felix 


_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to