Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
Hi Lars, hi all, we took the time and tested drbd 8.4.4-rc in our problematic scenario. We were able to reproduce the promote error regularly with drbd 8.4.3. After installing 8.4.4-rc we were not able to get this error any more. So, concerning the changes made to get around the known race condition, 8.4.4-rc seems to work. We didn't look at other aspects of the new version. If there is something we should test with your knowledge let us know. Best regards Andreas -Ursprüngliche Nachricht- Von: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars Ellenberg Gesendet: Dienstag, 10. September 2013 14:10 An: linux...@lists.linux-ha.org; pacemaker@oss.clusterlabs.org Betreff: Re: [Linux-HA] [Pacemaker] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10 On Mon, Sep 09, 2013 at 01:41:17PM +0200, Andreas Mock wrote: > Hi Lars, > > here also my official "Thank you very much" looking > at the problem. > I've been looking forward to the official release > of drbd 8.4.4. > > Or do you need disoriented rc testers like me? ;-) Why not? That's what release candidates are intended for. You'd only have to confirm that it works for you now. Respectively, that it still does not, in which case you better report that now than after the release, right? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list linux...@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
On Mon, Sep 09, 2013 at 01:41:17PM +0200, Andreas Mock wrote: > Hi Lars, > > here also my official "Thank you very much" looking > at the problem. > I've been looking forward to the official release > of drbd 8.4.4. > > Or do you need disoriented rc testers like me? ;-) Why not? That's what release candidates are intended for. You'd only have to confirm that it works for you now. Respectively, that it still does not, in which case you better report that now than after the release, right? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
Hi Lars, here also my official "Thank you very much" looking at the problem. Also thank you for writing a summary that - coming from your knowing and insider standpoint - is much better than I could do while trying to understand all details presented by you here and in our offlist communication. Additionally such a post gains much more value for a list archive when sent by a famous drbd, HA, pacemaker contributor like you are. I've been looking forward to the official release of drbd 8.4.4. Or do you need disoriented rc testers like me? ;-) Best regards Andreas Mock -Ursprüngliche Nachricht- Von: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars Ellenberg Gesendet: Montag, 9. September 2013 12:21 An: linux...@lists.linux-ha.org; pacemaker@oss.clusterlabs.org Betreff: Re: [Linux-HA] [Pacemaker] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10 On Mon, Sep 09, 2013 at 02:42:45PM +1000, Andrew Beekhof wrote: > > On 06/09/2013, at 5:51 PM, Lars Ellenberg wrote: > > > On Tue, Aug 27, 2013 at 06:51:45AM +0200, Andreas Mock wrote: > >> Hi Andrew, > >> > >> as this is a real showstopper at the moment I invested some other > >> hours to be sure (as far as possible) not having made an error. > >> > >> Some additions: > >> 1) I mirrored the whole mini drbd config to another pacemaker cluster. > >> Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not > >> 2) When I remove the target role Stopped from the drbd ms resource > >> and insert the config snippet related to the drbd device via crm -f > >> to a lean running pacemaker config (pacemaker cluster options, stonith > >> resources), > >> it seems to work. That means one of the nodes gets promoted. > >> > >> Then after stopping 'crm resource stop ms_drbd_xxx' and starting again > >> I see the same promotion error as described. > >> > >> The drbd resource agent is using /usr/sbin/crm_master. > >> Is there a possibility that feedback given through this client tool > >> is changing the timing behaviour of pacemaker? Or the way > >> transitions are scheduled? > >> Any idea that may be related to a change in pacemaker? > > > > I think that recent pacemaker allows for "start" and "promote" in the > > same transition. > > At least in the one case I saw logs of, this wasn't the case. > The PE computed: > > Current cluster status: > Online: [ db05 db06 ] > > r_stonith-db05(stonith:fence_imm):Started db06 > r_stonith-db06(stonith:fence_imm):Started db05 > Master/Slave Set: ms_drbd_fodb [r_drbd_fodb] > Slaves: [ db05 db06 ] > Master/Slave Set: ms_drbd_fodblog [r_drbd_fodblog] > Slaves: [ db05 db06 ] > > Transition Summary: > * Promote r_drbd_fodb:0 (Slave -> Master db05) > * Promote r_drbd_fodblog:0(Slave -> Master db05) > > and it was the promotion of r_drbd_fodb:0 that failed. Right. Off-list communication revealed that DRBD came up as "Consistent" only, which is a normal and expected state, when using resource level fencing. The promotion attempt then raced with the connection handshake. The DRBD fence-peer handler is run (because it's only Consistent, not UpToDate) and returns successfully, but due to that race, this result is ignored, DRBD stays "only Consistent", which is not good enough to be promoted ("need access to UpToDate data"). Once the handshake is done, that also results in "access to good data", which is why the next promotion attempt succeeds. Something in the timing of pacemaker actions has changed between the affected and unaffected versions. Apparently before there was enough time to do the connection handshake before the promote request was made. This race is fixed with DRBD 8.3.16 and 8.4.4 (currently rc1) You can avoid that race by not allowing Pacemaker to promote if DRBD is only "Consistent". Pacemaker will only attempt promotion, if there is a positive master score for the resource. The ocf:linbit:drbd RA hardcodes the master score for "Consistent" to 5. So you may edit the RA and instead remove the master score for the "only Consistent". (above mentioned fixed DRBD versions also introduce a new "adjust_master_score" paramater, and this becomes configurable) Or you can add a location constraint like this: location no-master-if-only-consistent ms_drbd_XY \ rule $role="Master" -10: defined #uname where "defined #uname" is a funny way to express "true", as in this constraint reduces the resulting master score by 10, always, anywhere. If you have other $role=Master constraints, you may need to play with the scores to achieve the desired outcome. > > I suspect you would not be able to reproduce by: > > crm resource stop ms_drbd > > crm resource demote ms_drbd (will only make drbd Secondary stuff) > >... meanwhile, DRBD will establish the connection ... > > crm resource promote ms_drbd (will then promote one node) By first allowing DRBD to do the handshake in Second
Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
On Mon, Sep 09, 2013 at 02:42:45PM +1000, Andrew Beekhof wrote: > > On 06/09/2013, at 5:51 PM, Lars Ellenberg wrote: > > > On Tue, Aug 27, 2013 at 06:51:45AM +0200, Andreas Mock wrote: > >> Hi Andrew, > >> > >> as this is a real showstopper at the moment I invested some other > >> hours to be sure (as far as possible) not having made an error. > >> > >> Some additions: > >> 1) I mirrored the whole mini drbd config to another pacemaker cluster. > >> Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not > >> 2) When I remove the target role Stopped from the drbd ms resource > >> and insert the config snippet related to the drbd device via crm -f > >> to a lean running pacemaker config (pacemaker cluster options, stonith > >> resources), > >> it seems to work. That means one of the nodes gets promoted. > >> > >> Then after stopping 'crm resource stop ms_drbd_xxx' and starting again > >> I see the same promotion error as described. > >> > >> The drbd resource agent is using /usr/sbin/crm_master. > >> Is there a possibility that feedback given through this client tool > >> is changing the timing behaviour of pacemaker? Or the way > >> transitions are scheduled? > >> Any idea that may be related to a change in pacemaker? > > > > I think that recent pacemaker allows for "start" and "promote" in the > > same transition. > > At least in the one case I saw logs of, this wasn't the case. > The PE computed: > > Current cluster status: > Online: [ db05 db06 ] > > r_stonith-db05(stonith:fence_imm):Started db06 > r_stonith-db06(stonith:fence_imm):Started db05 > Master/Slave Set: ms_drbd_fodb [r_drbd_fodb] > Slaves: [ db05 db06 ] > Master/Slave Set: ms_drbd_fodblog [r_drbd_fodblog] > Slaves: [ db05 db06 ] > > Transition Summary: > * Promote r_drbd_fodb:0 (Slave -> Master db05) > * Promote r_drbd_fodblog:0(Slave -> Master db05) > > and it was the promotion of r_drbd_fodb:0 that failed. Right. Off-list communication revealed that DRBD came up as "Consistent" only, which is a normal and expected state, when using resource level fencing. The promotion attempt then raced with the connection handshake. The DRBD fence-peer handler is run (because it's only Consistent, not UpToDate) and returns successfully, but due to that race, this result is ignored, DRBD stays "only Consistent", which is not good enough to be promoted ("need access to UpToDate data"). Once the handshake is done, that also results in "access to good data", which is why the next promotion attempt succeeds. Something in the timing of pacemaker actions has changed between the affected and unaffected versions. Apparently before there was enough time to do the connection handshake before the promote request was made. This race is fixed with DRBD 8.3.16 and 8.4.4 (currently rc1) You can avoid that race by not allowing Pacemaker to promote if DRBD is only "Consistent". Pacemaker will only attempt promotion, if there is a positive master score for the resource. The ocf:linbit:drbd RA hardcodes the master score for "Consistent" to 5. So you may edit the RA and instead remove the master score for the "only Consistent". (above mentioned fixed DRBD versions also introduce a new "adjust_master_score" paramater, and this becomes configurable) Or you can add a location constraint like this: location no-master-if-only-consistent ms_drbd_XY \ rule $role="Master" -10: defined #uname where "defined #uname" is a funny way to express "true", as in this constraint reduces the resulting master score by 10, always, anywhere. If you have other $role=Master constraints, you may need to play with the scores to achieve the desired outcome. > > I suspect you would not be able to reproduce by: > > crm resource stop ms_drbd > > crm resource demote ms_drbd (will only make drbd Secondary stuff) > >... meanwhile, DRBD will establish the connection ... > > crm resource promote ms_drbd (will then promote one node) By first allowing DRBD to do the handshake in Secondary/Secondary, and only later allowing it to promote, this sequence also avoids the race. Cheers, Lars -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
On 06/09/2013, at 5:51 PM, Lars Ellenberg wrote: > On Tue, Aug 27, 2013 at 06:51:45AM +0200, Andreas Mock wrote: >> Hi Andrew, >> >> as this is a real showstopper at the moment I invested some other >> hours to be sure (as far as possible) not having made an error. >> >> Some additions: >> 1) I mirrored the whole mini drbd config to another pacemaker cluster. >> Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not >> 2) When I remove the target role Stopped from the drbd ms resource >> and insert the config snippet related to the drbd device via crm -f >> to a lean running pacemaker config (pacemaker cluster options, stonith >> resources), >> it seems to work. That means one of the nodes gets promoted. >> >> Then after stopping 'crm resource stop ms_drbd_xxx' and starting again >> I see the same promotion error as described. >> >> The drbd resource agent is using /usr/sbin/crm_master. >> Is there a possibility that feedback given through this client tool >> is changing the timing behaviour of pacemaker? Or the way >> transitions are scheduled? >> Any idea that may be related to a change in pacemaker? > > I think that recent pacemaker allows for "start" and "promote" in the > same transition. At least in the one case I saw logs of, this wasn't the case. The PE computed: Current cluster status: Online: [ db05 db06 ] r_stonith-db05 (stonith:fence_imm):Started db06 r_stonith-db06 (stonith:fence_imm):Started db05 Master/Slave Set: ms_drbd_fodb [r_drbd_fodb] Slaves: [ db05 db06 ] Master/Slave Set: ms_drbd_fodblog [r_drbd_fodblog] Slaves: [ db05 db06 ] Transition Summary: * Promote r_drbd_fodb:0 (Slave -> Master db05) * Promote r_drbd_fodblog:0 (Slave -> Master db05) and it was the promotion of r_drbd_fodb:0 that failed. > Or at least has promote follow start much faster than before. > > To be able to really tell why DRBD has problems to promote, > I'd need the drbd (kernel!) logs, the agent logs are not good enough. > > Maybe you hit some interesting race between DRBD establishing the > connection, and pacemaker trying to promote one of the nodes. > Correlating DRBD kernel logs with pacemaker logs should tell. > > Do you have DRBD resource level fencing enabled? > > I guess right now you can reproduce it easily by: > crm resource stop ms_drbd > crm resource start ms_drbd > > I suspect you would not be able to reproduce by: > crm resource stop ms_drbd > crm resource demote ms_drbd (will only make drbd Secondary stuff) >... meanwhile, DRBD will establish the connection ... > crm resource promote ms_drbd (will then promote one node) > > Hth, > > Lars > > >> -Ursprüngliche Nachricht- >> Von: Andrew Beekhof [mailto:and...@beekhof.net] >> Gesendet: Dienstag, 27. August 2013 05:02 >> An: General Linux-HA mailing list >> Cc: pacemaker@oss.clusterlabs.org >> Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd >> agent between pacemaker 1.1.8 and 1.1.10 >> >> >> On 27/08/2013, at 3:31 AM, Andreas Mock wrote: >> >>> Hi all, >>> >>> while the linbit drbd resource agent seems to work perfectly on >>> pacemaker 1.1.8 (standard software repository) we have problems with >>> the last release 1.1.10 and also with the newest head 1.1.11.xxx. >>> >>> As using drbd is not so uncommon I really hope to find interested >>> people helping me out. I can provide as much debug information as you >>> want. >>> >>> >>> Environment: >>> RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster. >>> DRBD 8.4.3 compiled from sources. >>> 64bit >>> >>> - A drbd resource configured following the linbit documentation. >>> - Manual start and stop (up/down) and setting primary of drbd resource >>> working smoothly. >>> - 2 nodes dis03-test/dis04-test >>> >>> >>> >>> - Following simple config on pacemaker 1.1.8 configure >>> property no-quorum-policy=stop >>> property stonith-enabled=true >>> rsc_defaults resource-stickiness=2 >>> primitive r_stonith-dis03-test stonith:fence_mock \ >>> meta resource-stickiness="INFINITY" target-role="Started" \ >>> op monitor interval="180" timeout="300" requires="nothing" \ >>> op start interval="0" timeout="300" \ >>> op stop interval="0" timeout="300" \ >>&g
Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
On Tue, Aug 27, 2013 at 06:51:45AM +0200, Andreas Mock wrote: > Hi Andrew, > > as this is a real showstopper at the moment I invested some other > hours to be sure (as far as possible) not having made an error. > > Some additions: > 1) I mirrored the whole mini drbd config to another pacemaker cluster. > Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not > 2) When I remove the target role Stopped from the drbd ms resource > and insert the config snippet related to the drbd device via crm -f > to a lean running pacemaker config (pacemaker cluster options, stonith > resources), > it seems to work. That means one of the nodes gets promoted. > > Then after stopping 'crm resource stop ms_drbd_xxx' and starting again > I see the same promotion error as described. > > The drbd resource agent is using /usr/sbin/crm_master. > Is there a possibility that feedback given through this client tool > is changing the timing behaviour of pacemaker? Or the way > transitions are scheduled? > Any idea that may be related to a change in pacemaker? I think that recent pacemaker allows for "start" and "promote" in the same transition. Or at least has promote follow start much faster than before. To be able to really tell why DRBD has problems to promote, I'd need the drbd (kernel!) logs, the agent logs are not good enough. Maybe you hit some interesting race between DRBD establishing the connection, and pacemaker trying to promote one of the nodes. Correlating DRBD kernel logs with pacemaker logs should tell. Do you have DRBD resource level fencing enabled? I guess right now you can reproduce it easily by: crm resource stop ms_drbd crm resource start ms_drbd I suspect you would not be able to reproduce by: crm resource stop ms_drbd crm resource demote ms_drbd (will only make drbd Secondary stuff) ... meanwhile, DRBD will establish the connection ... crm resource promote ms_drbd (will then promote one node) Hth, Lars > -Ursprüngliche Nachricht- > Von: Andrew Beekhof [mailto:and...@beekhof.net] > Gesendet: Dienstag, 27. August 2013 05:02 > An: General Linux-HA mailing list > Cc: pacemaker@oss.clusterlabs.org > Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd > agent between pacemaker 1.1.8 and 1.1.10 > > > On 27/08/2013, at 3:31 AM, Andreas Mock wrote: > > > Hi all, > > > > while the linbit drbd resource agent seems to work perfectly on > > pacemaker 1.1.8 (standard software repository) we have problems with > > the last release 1.1.10 and also with the newest head 1.1.11.xxx. > > > > As using drbd is not so uncommon I really hope to find interested > > people helping me out. I can provide as much debug information as you > > want. > > > > > > Environment: > > RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster. > > DRBD 8.4.3 compiled from sources. > > 64bit > > > > - A drbd resource configured following the linbit documentation. > > - Manual start and stop (up/down) and setting primary of drbd resource > > working smoothly. > > - 2 nodes dis03-test/dis04-test > > > > > > > > - Following simple config on pacemaker 1.1.8 configure > >property no-quorum-policy=stop > >property stonith-enabled=true > >rsc_defaults resource-stickiness=2 > >primitive r_stonith-dis03-test stonith:fence_mock \ > >meta resource-stickiness="INFINITY" target-role="Started" \ > >op monitor interval="180" timeout="300" requires="nothing" \ > >op start interval="0" timeout="300" \ > >op stop interval="0" timeout="300" \ > >params vmname=dis03-test pcmk_host_list="dis03-test" > >primitive r_stonith-dis04-test stonith:fence_mock \ > >meta resource-stickiness="INFINITY" target-role="Started" \ > >op monitor interval="180" timeout="300" requires="nothing" \ > >op start interval="0" timeout="300" \ > >op stop interval="0" timeout="300" \ > >params vmname=dis04-test pcmk_host_list="dis04-test" > >location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \ > >rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname > > eq dis03-test > >location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \ > >rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname > > eq dis04-test > >primitive r_drbd_postfix
Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
Hi Andrew, thank you having still an eye on that issue. I'll do my best to present the requested reports. Best regards Andreas Mock -Ursprüngliche Nachricht- Von: Andrew Beekhof [mailto:and...@beekhof.net] Gesendet: Mittwoch, 28. August 2013 00:12 An: General Linux-HA mailing list Cc: 'The Pacemaker cluster resource manager' Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10 On 27/08/2013, at 2:51 PM, Andreas Mock wrote: > Hi Andrew, > > as this is a real showstopper at the moment I invested some other > hours to be sure (as far as possible) not having made an error. > > Some additions: > 1) I mirrored the whole mini drbd config to another pacemaker cluster. > Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not The version of drbd is the same too? > 2) When I remove the target role Stopped from the drbd ms resource and > insert the config snippet related to the drbd device via crm -f > to a lean running pacemaker config (pacemaker cluster options, stonith > resources), it seems to work. That means one of the nodes gets > promoted. > > Then after stopping 'crm resource stop ms_drbd_xxx' and starting again > I see the same promotion error as described. > > The drbd resource agent is using /usr/sbin/crm_master. > Is there a possibility that feedback given through this client tool is > changing the timing behaviour of pacemaker? Or the way transitions are > scheduled? > Any idea that may be related to a change in pacemaker? # git diff --stat Pacemaker-1.1.8..Pacemaker-1.1.10 | tail -n 1 1610 files changed, 109697 insertions(+), 62940 deletions(-) Needle, meet haystack. Particularly since I have no idea what that drbd error means. If you want me to have a look, you'll need to create a crm_report archive of "works" and "not works". Logs aren't enough. > > Best regards > Andreas Mock > > > -Ursprüngliche Nachricht- > Von: Andrew Beekhof [mailto:and...@beekhof.net] > Gesendet: Dienstag, 27. August 2013 05:02 > An: General Linux-HA mailing list > Cc: pacemaker@oss.clusterlabs.org > Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the > linbit drbd agent between pacemaker 1.1.8 and 1.1.10 > > > On 27/08/2013, at 3:31 AM, Andreas Mock wrote: > >> Hi all, >> >> while the linbit drbd resource agent seems to work perfectly on >> pacemaker 1.1.8 (standard software repository) we have problems with >> the last release 1.1.10 and also with the newest head 1.1.11.xxx. >> >> As using drbd is not so uncommon I really hope to find interested >> people helping me out. I can provide as much debug information as you >> want. >> >> >> Environment: >> RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster. >> DRBD 8.4.3 compiled from sources. >> 64bit >> >> - A drbd resource configured following the linbit documentation. >> - Manual start and stop (up/down) and setting primary of drbd >> resource working smoothly. >> - 2 nodes dis03-test/dis04-test >> >> >> >> - Following simple config on pacemaker 1.1.8 configure >> property no-quorum-policy=stop >> property stonith-enabled=true >> rsc_defaults resource-stickiness=2 >> primitive r_stonith-dis03-test stonith:fence_mock \ >> meta resource-stickiness="INFINITY" target-role="Started" \ >> op monitor interval="180" timeout="300" requires="nothing" \ >> op start interval="0" timeout="300" \ >> op stop interval="0" timeout="300" \ >> params vmname=dis03-test pcmk_host_list="dis03-test" >> primitive r_stonith-dis04-test stonith:fence_mock \ >> meta resource-stickiness="INFINITY" target-role="Started" \ >> op monitor interval="180" timeout="300" requires="nothing" \ >> op start interval="0" timeout="300" \ >> op stop interval="0" timeout="300" \ >> params vmname=dis04-test pcmk_host_list="dis04-test" >> location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \ >> rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname >> eq dis03-test >> location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \ >> rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname >> eq dis04-test >> primitive r_drbd_postfix ocf:linbit:drbd \ >> params drbd_resource="postfix" dr
Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
On 27/08/2013, at 2:51 PM, Andreas Mock wrote: > Hi Andrew, > > as this is a real showstopper at the moment I invested some other > hours to be sure (as far as possible) not having made an error. > > Some additions: > 1) I mirrored the whole mini drbd config to another pacemaker cluster. > Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not The version of drbd is the same too? > 2) When I remove the target role Stopped from the drbd ms resource > and insert the config snippet related to the drbd device via crm -f > to a lean running pacemaker config (pacemaker cluster options, stonith > resources), > it seems to work. That means one of the nodes gets promoted. > > Then after stopping 'crm resource stop ms_drbd_xxx' and starting again > I see the same promotion error as described. > > The drbd resource agent is using /usr/sbin/crm_master. > Is there a possibility that feedback given through this client tool > is changing the timing behaviour of pacemaker? Or the way > transitions are scheduled? > Any idea that may be related to a change in pacemaker? # git diff --stat Pacemaker-1.1.8..Pacemaker-1.1.10 | tail -n 1 1610 files changed, 109697 insertions(+), 62940 deletions(-) Needle, meet haystack. Particularly since I have no idea what that drbd error means. If you want me to have a look, you'll need to create a crm_report archive of "works" and "not works". Logs aren't enough. > > Best regards > Andreas Mock > > > -Ursprüngliche Nachricht- > Von: Andrew Beekhof [mailto:and...@beekhof.net] > Gesendet: Dienstag, 27. August 2013 05:02 > An: General Linux-HA mailing list > Cc: pacemaker@oss.clusterlabs.org > Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd > agent between pacemaker 1.1.8 and 1.1.10 > > > On 27/08/2013, at 3:31 AM, Andreas Mock wrote: > >> Hi all, >> >> while the linbit drbd resource agent seems to work perfectly on >> pacemaker 1.1.8 (standard software repository) we have problems with >> the last release 1.1.10 and also with the newest head 1.1.11.xxx. >> >> As using drbd is not so uncommon I really hope to find interested >> people helping me out. I can provide as much debug information as you >> want. >> >> >> Environment: >> RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster. >> DRBD 8.4.3 compiled from sources. >> 64bit >> >> - A drbd resource configured following the linbit documentation. >> - Manual start and stop (up/down) and setting primary of drbd resource >> working smoothly. >> - 2 nodes dis03-test/dis04-test >> >> >> >> - Following simple config on pacemaker 1.1.8 configure >> property no-quorum-policy=stop >> property stonith-enabled=true >> rsc_defaults resource-stickiness=2 >> primitive r_stonith-dis03-test stonith:fence_mock \ >> meta resource-stickiness="INFINITY" target-role="Started" \ >> op monitor interval="180" timeout="300" requires="nothing" \ >> op start interval="0" timeout="300" \ >> op stop interval="0" timeout="300" \ >> params vmname=dis03-test pcmk_host_list="dis03-test" >> primitive r_stonith-dis04-test stonith:fence_mock \ >> meta resource-stickiness="INFINITY" target-role="Started" \ >> op monitor interval="180" timeout="300" requires="nothing" \ >> op start interval="0" timeout="300" \ >> op stop interval="0" timeout="300" \ >> params vmname=dis04-test pcmk_host_list="dis04-test" >> location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \ >> rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname >> eq dis03-test >> location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \ >> rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname >> eq dis04-test >> primitive r_drbd_postfix ocf:linbit:drbd \ >> params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf" > \ >> op monitor interval="15s" timeout="60s" role="Master" \ >> op monitor interval="45s" timeout="60s" role="Slave" \ >> op start timeout="240" \ >> op stop timeout="240" \ >> meta target-role="Stopped" migration-threshold="2" >> ms ms_drbd_postfix r_drbd_postfix \ &
Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
Hi Andrew, as this is a real showstopper at the moment I invested some other hours to be sure (as far as possible) not having made an error. Some additions: 1) I mirrored the whole mini drbd config to another pacemaker cluster. Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not 2) When I remove the target role Stopped from the drbd ms resource and insert the config snippet related to the drbd device via crm -f to a lean running pacemaker config (pacemaker cluster options, stonith resources), it seems to work. That means one of the nodes gets promoted. Then after stopping 'crm resource stop ms_drbd_xxx' and starting again I see the same promotion error as described. The drbd resource agent is using /usr/sbin/crm_master. Is there a possibility that feedback given through this client tool is changing the timing behaviour of pacemaker? Or the way transitions are scheduled? Any idea that may be related to a change in pacemaker? Best regards Andreas Mock -Ursprüngliche Nachricht- Von: Andrew Beekhof [mailto:and...@beekhof.net] Gesendet: Dienstag, 27. August 2013 05:02 An: General Linux-HA mailing list Cc: pacemaker@oss.clusterlabs.org Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10 On 27/08/2013, at 3:31 AM, Andreas Mock wrote: > Hi all, > > while the linbit drbd resource agent seems to work perfectly on > pacemaker 1.1.8 (standard software repository) we have problems with > the last release 1.1.10 and also with the newest head 1.1.11.xxx. > > As using drbd is not so uncommon I really hope to find interested > people helping me out. I can provide as much debug information as you > want. > > > Environment: > RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster. > DRBD 8.4.3 compiled from sources. > 64bit > > - A drbd resource configured following the linbit documentation. > - Manual start and stop (up/down) and setting primary of drbd resource > working smoothly. > - 2 nodes dis03-test/dis04-test > > > > - Following simple config on pacemaker 1.1.8 configure >property no-quorum-policy=stop >property stonith-enabled=true >rsc_defaults resource-stickiness=2 >primitive r_stonith-dis03-test stonith:fence_mock \ >meta resource-stickiness="INFINITY" target-role="Started" \ >op monitor interval="180" timeout="300" requires="nothing" \ >op start interval="0" timeout="300" \ >op stop interval="0" timeout="300" \ >params vmname=dis03-test pcmk_host_list="dis03-test" >primitive r_stonith-dis04-test stonith:fence_mock \ >meta resource-stickiness="INFINITY" target-role="Started" \ >op monitor interval="180" timeout="300" requires="nothing" \ >op start interval="0" timeout="300" \ >op stop interval="0" timeout="300" \ >params vmname=dis04-test pcmk_host_list="dis04-test" >location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \ >rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname > eq dis03-test >location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \ >rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname > eq dis04-test >primitive r_drbd_postfix ocf:linbit:drbd \ >params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf" \ >op monitor interval="15s" timeout="60s" role="Master" \ >op monitor interval="45s" timeout="60s" role="Slave" \ >op start timeout="240" \ >op stop timeout="240" \ >meta target-role="Stopped" migration-threshold="2" >ms ms_drbd_postfix r_drbd_postfix \ >meta master-max="1" master-node-max="1" \ >clone-max="2" clone-node-max="1" \ >notify="true" \ >meta target-role="Stopped" > commit > > - Pacemaker is started from scratch > - Config above is applied by crm -f where has the above > config snippet. > > - After that crm_mon shows the following status > --8<- > Last updated: Mon Aug 26 18:42:47 2013 Last change: Mon Aug 26 > 18:42:42 2013 via cibadmin on dis03-test > Stack: cman > Current DC: dis03-test - partition with quorum > Version: 1.1.10-1.el6-9abe687 > 2 Nodes configured > 4 Resources configured > > > Online: [ dis03-test dis04-test ] > > Full list o
Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
On 27/08/2013, at 3:31 AM, Andreas Mock wrote: > Hi all, > > while the linbit drbd resource agent seems to work perfectly on > pacemaker 1.1.8 (standard software repository) we have problems > with the last release 1.1.10 and also with the newest head > 1.1.11.xxx. > > As using drbd is not so uncommon I really hope to find interested > people helping me out. I can provide as much debug information as > you want. > > > Environment: > RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster. > DRBD 8.4.3 compiled from sources. > 64bit > > - A drbd resource configured following the linbit documentation. > - Manual start and stop (up/down) and setting primary of drbd resource > working smoothly. > - 2 nodes dis03-test/dis04-test > > > > - Following simple config on pacemaker 1.1.8 > configure >property no-quorum-policy=stop >property stonith-enabled=true >rsc_defaults resource-stickiness=2 >primitive r_stonith-dis03-test stonith:fence_mock \ >meta resource-stickiness="INFINITY" target-role="Started" \ >op monitor interval="180" timeout="300" requires="nothing" \ >op start interval="0" timeout="300" \ >op stop interval="0" timeout="300" \ >params vmname=dis03-test pcmk_host_list="dis03-test" >primitive r_stonith-dis04-test stonith:fence_mock \ >meta resource-stickiness="INFINITY" target-role="Started" \ >op monitor interval="180" timeout="300" requires="nothing" \ >op start interval="0" timeout="300" \ >op stop interval="0" timeout="300" \ >params vmname=dis04-test pcmk_host_list="dis04-test" >location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \ >rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname eq > dis03-test >location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \ >rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname eq > dis04-test >primitive r_drbd_postfix ocf:linbit:drbd \ >params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf" \ >op monitor interval="15s" timeout="60s" role="Master" \ >op monitor interval="45s" timeout="60s" role="Slave" \ >op start timeout="240" \ >op stop timeout="240" \ >meta target-role="Stopped" migration-threshold="2" >ms ms_drbd_postfix r_drbd_postfix \ >meta master-max="1" master-node-max="1" \ >clone-max="2" clone-node-max="1" \ >notify="true" \ >meta target-role="Stopped" > commit > > - Pacemaker is started from scratch > - Config above is applied by crm -f where > has the above config snippet. > > - After that crm_mon shows the following status > --8<- > Last updated: Mon Aug 26 18:42:47 2013 > Last change: Mon Aug 26 18:42:42 2013 via cibadmin on dis03-test > Stack: cman > Current DC: dis03-test - partition with quorum > Version: 1.1.10-1.el6-9abe687 > 2 Nodes configured > 4 Resources configured > > > Online: [ dis03-test dis04-test ] > > Full list of resources: > > r_stonith-dis03-test (stonith:fence_mock): Started dis04-test > r_stonith-dis04-test (stonith:fence_mock): Started dis03-test > Master/Slave Set: ms_drbd_postfix [r_drbd_postfix] > Stopped: [ dis03-test dis04-test ] > > Migration summary: > * Node dis04-test: > * Node dis03-test: > --8<- > > cat /proc/drbd > version: 8.4.3 (api:1/proto:86-101) > GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by root@dis03-test, > 2013-07-24 17:19:24 > > on both nodes. The drbd resource was shutdown previously in a clean state, > so that any node can be the primary. > > - Now the weird behaviour when trying to start the drbd > with > crm resource start ms_drbd_postfix > > > Output of crm_mon -1rf > --8<- > Last updated: Mon Aug 26 18:46:33 2013 > Last change: Mon Aug 26 18:46:30 2013 via cibadmin on dis04-test > Stack: cman > Current DC: dis03-test - partition with quorum > Version: 1.1.10-1.el6-9abe687 > 2 Nodes configured > 4 Resources configured > > > Online: [ dis03-test dis04-test ] > > Full list of resources: > > r_stonith-dis03-test (stonith:fence_mock): Started dis04-test > r_stonith-dis04-test (stonith:fence_mock): Started dis03-test > Master/Slave Set: ms_drbd_postfix [r_drbd_postfix] > Slaves: [ dis03-test ] > Stopped: [ dis04-test ] > > Migration summary: > * Node dis04-test: > r_drbd_postfix: migration-threshold=2 fail-count=2 last-failure='Mon Aug > 26 18:46:30 2013' > * Node dis03-test: > Its hard to imagine how pacemaker could cause drbdadm to fail, short of leaving the other side promoted while trying to promote another. Perhaps the drbd folks could comment on what the error means. > Failed actions: >r_drbd_postfix_promote_0 (node=dis04-test, call=34, rc=1, > status=complete, last-rc-change=Mon Aug 26 18:46:29 2013 > , queued=1212ms, exec=0ms > ): unkno