Re: [Pacemaker] crond on both nodes (active/passive) but some jobs on active only
thanks a lot ! 2013/7/5 Lars Ellenberg > On Fri, Jul 05, 2013 at 04:52:35PM +0200, andreas graeper wrote: > > when i wrote a script handled by ocf:heartbeat:anything i.e. that is > > signalling the cron-daemon to reload crontabs > > when crontab file is enabled by symlink:start and disabled by > symlink:stop > > > > how can i achieve that the script runs after symlink :start and :stop ? > > when i define order-constraint R1 then R2 this implizit means R1:start , > > R2:start and R2:stop, R1:stop ? > > > > Not an answer to that specific question, > rather a "why even bother" suggestion: > > You say: > > > two nodes active/passive and fetchmail as cronjob shall run on active > only. > > How do you know the node is "active"? > Maybe some specific file system is mounted? > Great. You have files and directories > which are only visible on an "active" node. > > Why not prefix your cron job lines with > test -e /this/file/only/visible/on/active || exit 0; real cron command > follows > or > cd /some/dir/only/on/active || exit 0; real cron command > > or a wrapper, if that looks too ugly > only-on-active real cron command > > /bin/only-on-active: > #!/bin/sh > same-active-test-as-above || exit 0 > "$@" # do the real cron command > > Lars > > > 2013/7/5 andreas graeper > > > > > hi, > > > two nodes active/passive and fetchmail as cronjob shall run on active > only. > > > > > > i use ocf:heartbeat:symlink to move / rename > > > > > > /etc/cron.d/jobs <> /etc/cron.d/jobs.disable > > > > > > i read anywhere crond ignores files with dot. > > > > > > but new experience: crond needs to restarted or signalled. > > > > > > how this is done best within pacemaker ? > > > is clone for me ? > > > > > > > > > thanks in advance > > > andreas > > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
- Original Message - > From: "David Vossel" > To: "The Pacemaker cluster resource manager" > Sent: Wednesday, July 3, 2013 4:20:37 PM > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes > > - Original Message - > > From: "Lindsay Todd" > > To: "The Pacemaker cluster resource manager" > > > > Sent: Wednesday, July 3, 2013 2:12:05 PM > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes > > > > Well, I'm not getting failures right now simply with attributes, but I can > > induce a failure by stopping the vm-db02 (it puts db02 into an unclean > > state, and attempts to migrate the unrelated vm-compute-test). I've > > collected the commands from my latest interactions, a crm_report, and a gdb > > traceback from the core file that crmd dumped, into bug 5164. > > > Thanks, hopefully I can start investigating this Friday > > -- Vossel Yeah, this is a bad one. Adding the node attributes using crm_attribute for the remote-node did some unexpected things to the crmd component. Somehow the remote-node was getting entered into the cluster node cache... which made it look like we had both a cluster-node and remote-node named the same thing... not good. I think I got that part worked out. Try this patch. https://github.com/ClusterLabs/pacemaker/commit/67dfff76d632f1796c9ded8fd367aa49258c8c32 Rather than trying to patch RCs, it might be worth trying out the master branch on github (which already has this patch). If you aren't already, use rpms to make your life easier. Running 'make rpm' in the source directory will generate them for you. There was another bug fixed recently in pacemaker_remote involving the directory created for resource agents to store their temporary data (stuff like pid files). I believe the fix was not introduced until 1.1.10rc6. -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crond on both nodes (active/passive) but some jobs on active only
On Fri, Jul 05, 2013 at 04:52:35PM +0200, andreas graeper wrote: > when i wrote a script handled by ocf:heartbeat:anything i.e. that is > signalling the cron-daemon to reload crontabs > when crontab file is enabled by symlink:start and disabled by symlink:stop > > how can i achieve that the script runs after symlink :start and :stop ? > when i define order-constraint R1 then R2 this implizit means R1:start , > R2:start and R2:stop, R1:stop ? > Not an answer to that specific question, rather a "why even bother" suggestion: You say: > > two nodes active/passive and fetchmail as cronjob shall run on active only. How do you know the node is "active"? Maybe some specific file system is mounted? Great. You have files and directories which are only visible on an "active" node. Why not prefix your cron job lines with test -e /this/file/only/visible/on/active || exit 0; real cron command follows or cd /some/dir/only/on/active || exit 0; real cron command or a wrapper, if that looks too ugly only-on-active real cron command /bin/only-on-active: #!/bin/sh same-active-test-as-above || exit 0 "$@" # do the real cron command Lars > 2013/7/5 andreas graeper > > > hi, > > two nodes active/passive and fetchmail as cronjob shall run on active only. > > > > i use ocf:heartbeat:symlink to move / rename > > > > /etc/cron.d/jobs <> /etc/cron.d/jobs.disable > > > > i read anywhere crond ignores files with dot. > > > > but new experience: crond needs to restarted or signalled. > > > > how this is done best within pacemaker ? > > is clone for me ? > > > > > > thanks in advance > > andreas -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Another question about fencing/stonithing
Thank you for your hint. There is a German saying which I try to translate: "You don't see the forest 'cause of all the trees" So, I'll see. Best regards Andreas Mock -Ursprüngliche Nachricht- Von: Digimer [mailto:li...@alteeve.ca] Gesendet: Freitag, 5. Juli 2013 17:22 An: Andreas Mock Cc: 'The Pacemaker cluster resource manager'; 'Marek Grac' Betreff: Re: AW: [Pacemaker] Another question about fencing/stonithing Andrew might know the trick. In theory, putting your agent into the /usr/sbin or /sbin directory (where ever the other agents are) should "just work". You're sure the exit codes are appropriate? I am sure they are, but just thinking out loud about too-obvious-to-see possible issues. On 05/07/13 11:17, Andreas Mock wrote: > Hi Digimer, > > sorry I forget to mention that I implemented the metadata-call > accordingly. But it may be the "registration" thing which > is necessary to make it know to the stonith/fencing daemon. > > I don't know. I'm wondering a little bit that there is no > pointer how to do it. > > Thank you for your answer! > > Best regards > Andreas Mock > > > -Ursprüngliche Nachricht- > Von: Digimer [mailto:li...@alteeve.ca] > Gesendet: Freitag, 5. Juli 2013 16:52 > An: The Pacemaker cluster resource manager > Cc: Andreas Mock; Marek Grac > Betreff: Re: [Pacemaker] Another question about fencing/stonithing > > On 05/07/13 03:34, Andreas Mock wrote: >> Hi all, >> >> I just wrote a stonith agent which IMHO implements the >> API spec found at https://fedorahosted.org/cluster/wiki/FenceAgentAPI. >> >> But it seems it has a problem when used as pacemaker stonith device. >> >> What has to be done, to have a stonith/fencing agent which implements >> both roles. I'm pretty sure something is missing. >> It's just a guess that it has something to do with listing "registered" >> agents. >> >> What is a registered stonith agent and what is done while registering it? >> >> When I configure my own fencing agent as packemaker stonith device >> and try to do a "stonith_admin --list=nodename" I get a "no such device" >> error. >> >> Any pointer appreciated. >> >> Best regards >> Andreas Mock > > The API doesn't (yet) cover the metadata action. The agents now have to > print out XML validation of valid attributes and elements for your > agent. If you call any existing fence_* agent with just -o metadata, you > will see the format. > > I know rhcs can be forced to see the new agent by putting it in the same > directory as the other agents and then running 'ccs_update_schema'. If > pacemaker doesn't immediately see it, then there might be an equivalent > command you can run. > > I will try to get the API updated. I'm not a cardinal source, but > something is better than nothing. Marek (who I have cc'ed) is, so I can > run the changes by him when done to ensure they're accurate. > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Pacemaker 1.1.10 rc 5 & rc 6
Hi, I'm trying to update pacemaker on centos 6.4 hosts but each release introduces some new problems %). we have centos 6.4 corosync, cman packages and latest pcs / pacemaker. Cluster is cman based. Pacemaker 1.1.10 rc5 was almost nice, excluding repeating message on all of our nodes: Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: warning: cib_process_diff: Diff 0.4.83 -> 0.4.84 from local not applied to 0.4.83: Failed application of an update diff Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: warning: cib_process_diff: Diff 0.4.84 -> 0.4.85 from local not applied to 0.4.84: Failed application of an update diff Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: warning: cib_process_diff: Diff 0.4.85 -> 0.4.86 from local not applied to 0.4.85: Failed application of an update diff Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: warning: cib_process_diff: Diff 0.4.86 -> 0.4.87 from local not applied to 0.4.86: Failed application of an update diff Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: warning: cib_process_diff: Diff 0.4.87 -> 0.4.88 from local not applied to 0.4.87: Failed application of an update diff Jul 5 10:32:34 devpacemaker01 stonith-ng[14501]: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) Jul 5 10:32:35 devpacemaker01 stonith-ng[14501]: warning: cib_process_diff: Diff 0.4.88 -> 0.4.89 from local not applied to 0.4.88: Failed application of an update diff Jul 5 10:32:35 devpacemaker01 stonith-ng[14501]: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) Not critical, as the stuff worked, but looks strange, it doesn't matter what you do, it keeps complaining. Full CIB resync, fresh configuration importing , nothing helps. Cluster Properties: cluster-delay: 10s cluster-infrastructure: cman cluster-recheck-interval: 2min last-lrm-refresh: 1373023780 no-quorum-policy: freeze start-failure-is-fatal: true stonith-enabled: false today, we upgraded to 1.1.10 rc6 and it made it worse... Also, it broke 'default' fencing. Previously, even with stonith-enabled: false, pacemaker was trying to kill cman / corosync if connection is lost or split brain occurs, but now it's not happening: Jul 5 09:54:25 devpacemaker01 crmd[20840]: notice: tengine_stonith_notify: Peer devpacemaker03_eth1 was not terminated (reboot) by devpacemaker02_eth1 for devpacemaker02_eth1: No such device (ref=1fc11b87-529d-4f6c-b4e6-ffaa82c06bd8) by client stonith_admin.cman.8832 Jul 5 09:54:28 devpacemaker01 stonith-ng[20838]: notice: remote_op_done: Operation reboot of devpacemaker03_eth1 by devpacemaker02_eth1 for stonith_admin.cman.8855@devpacemaker02_eth1.6e0e0da3: No such device Jul 5 09:54:28 devpacemaker01 crmd[20840]: notice: tengine_stonith_notify: Peer devpacemaker03_eth1 was not terminated (reboot) by devpacemaker02_eth1 for devpacemaker02_eth1: No such device (ref=6e0e0da3-f9f9-43a0-933e-0ff9ec2cb390) by client stonith_admin.cman.8855 Jul 5 09:54:31 devpacemaker01 stonith-ng[20838]: notice: remote_op_done: Operation reboot of devpacemaker03_eth1 by devpacemaker02_eth1 for stonith_admin.cman.9017@devpacemaker02_eth1.955b859b: No such device Jul 5 09:54:31 devpacemaker01 crmd[20840]: notice: tengine_stonith_notify: Peer devpacemaker03_eth1 was not terminated (reboot) by devpacemaker02_eth1 for devpacemaker02_eth1: No such device (ref=955b859b-791e-4083-b760-a6f8f05ddc2f) by client stonith_admin.cman.9017 Jul 5 09:54:35 devpacemaker01 stonith-ng[20838]: notice: remote_op_done: Operation reboot of devpacemaker03_eth1 by devpacemaker02_eth1 for stonith_admin.cman.9089@devpacemaker02_eth1.ede9aa4e: No such device Jul 5 09:54:35 devpacemaker01 crmd[20840]: notice: tengine_stonith_notify: Peer devpacemaker03_eth1 was not terminated (reboot) by devpacemaker02_eth1 for devpacemaker02_eth1: No such device (ref=ede9aa4e-32e0-4f3d-bd3a-f519c1250363) by client stonith_admin.cman.9089 Jul 5 09:54:38 devpacemaker01 stonith-ng[20838]: notice: remote_op_done: Operation reboot of devpacemaker03_eth1 by devpacemaker02_eth1 for stonith_admin.cman.9242@devpacemaker02_eth1.2d92ca8d: No such device Ju
Re: [Pacemaker] Another question about fencing/stonithing
Andrew might know the trick. In theory, putting your agent into the /usr/sbin or /sbin directory (where ever the other agents are) should "just work". You're sure the exit codes are appropriate? I am sure they are, but just thinking out loud about too-obvious-to-see possible issues. On 05/07/13 11:17, Andreas Mock wrote: Hi Digimer, sorry I forget to mention that I implemented the metadata-call accordingly. But it may be the "registration" thing which is necessary to make it know to the stonith/fencing daemon. I don't know. I'm wondering a little bit that there is no pointer how to do it. Thank you for your answer! Best regards Andreas Mock -Ursprüngliche Nachricht- Von: Digimer [mailto:li...@alteeve.ca] Gesendet: Freitag, 5. Juli 2013 16:52 An: The Pacemaker cluster resource manager Cc: Andreas Mock; Marek Grac Betreff: Re: [Pacemaker] Another question about fencing/stonithing On 05/07/13 03:34, Andreas Mock wrote: Hi all, I just wrote a stonith agent which IMHO implements the API spec found at https://fedorahosted.org/cluster/wiki/FenceAgentAPI. But it seems it has a problem when used as pacemaker stonith device. What has to be done, to have a stonith/fencing agent which implements both roles. I'm pretty sure something is missing. It's just a guess that it has something to do with listing "registered" agents. What is a registered stonith agent and what is done while registering it? When I configure my own fencing agent as packemaker stonith device and try to do a "stonith_admin --list=nodename" I get a "no such device" error. Any pointer appreciated. Best regards Andreas Mock The API doesn't (yet) cover the metadata action. The agents now have to print out XML validation of valid attributes and elements for your agent. If you call any existing fence_* agent with just -o metadata, you will see the format. I know rhcs can be forced to see the new agent by putting it in the same directory as the other agents and then running 'ccs_update_schema'. If pacemaker doesn't immediately see it, then there might be an equivalent command you can run. I will try to get the API updated. I'm not a cardinal source, but something is better than nothing. Marek (who I have cc'ed) is, so I can run the changes by him when done to ensure they're accurate. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Another question about fencing/stonithing
Hi Digimer, sorry I forget to mention that I implemented the metadata-call accordingly. But it may be the "registration" thing which is necessary to make it know to the stonith/fencing daemon. I don't know. I'm wondering a little bit that there is no pointer how to do it. Thank you for your answer! Best regards Andreas Mock -Ursprüngliche Nachricht- Von: Digimer [mailto:li...@alteeve.ca] Gesendet: Freitag, 5. Juli 2013 16:52 An: The Pacemaker cluster resource manager Cc: Andreas Mock; Marek Grac Betreff: Re: [Pacemaker] Another question about fencing/stonithing On 05/07/13 03:34, Andreas Mock wrote: > Hi all, > > I just wrote a stonith agent which IMHO implements the > API spec found at https://fedorahosted.org/cluster/wiki/FenceAgentAPI. > > But it seems it has a problem when used as pacemaker stonith device. > > What has to be done, to have a stonith/fencing agent which implements > both roles. I'm pretty sure something is missing. > It's just a guess that it has something to do with listing "registered" > agents. > > What is a registered stonith agent and what is done while registering it? > > When I configure my own fencing agent as packemaker stonith device > and try to do a "stonith_admin --list=nodename" I get a "no such device" > error. > > Any pointer appreciated. > > Best regards > Andreas Mock The API doesn't (yet) cover the metadata action. The agents now have to print out XML validation of valid attributes and elements for your agent. If you call any existing fence_* agent with just -o metadata, you will see the format. I know rhcs can be forced to see the new agent by putting it in the same directory as the other agents and then running 'ccs_update_schema'. If pacemaker doesn't immediately see it, then there might be an equivalent command you can run. I will try to get the API updated. I'm not a cardinal source, but something is better than nothing. Marek (who I have cc'ed) is, so I can run the changes by him when done to ensure they're accurate. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Another question about fencing/stonithing
On 05/07/13 03:34, Andreas Mock wrote: Hi all, I just wrote a stonith agent which IMHO implements the API spec found at https://fedorahosted.org/cluster/wiki/FenceAgentAPI. But it seems it has a problem when used as pacemaker stonith device. What has to be done, to have a stonith/fencing agent which implements both roles. I'm pretty sure something is missing. It's just a guess that it has something to do with listing "registered" agents. What is a registered stonith agent and what is done while registering it? When I configure my own fencing agent as packemaker stonith device and try to do a "stonith_admin --list=nodename" I get a "no such device" error. Any pointer appreciated. Best regards Andreas Mock The API doesn't (yet) cover the metadata action. The agents now have to print out XML validation of valid attributes and elements for your agent. If you call any existing fence_* agent with just -o metadata, you will see the format. I know rhcs can be forced to see the new agent by putting it in the same directory as the other agents and then running 'ccs_update_schema'. If pacemaker doesn't immediately see it, then there might be an equivalent command you can run. I will try to get the API updated. I'm not a cardinal source, but something is better than nothing. Marek (who I have cc'ed) is, so I can run the changes by him when done to ensure they're accurate. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crond on both nodes (active/passive) but some jobs on active only
when i wrote a script handled by ocf:heartbeat:anything i.e. that is signalling the cron-daemon to reload crontabs when crontab file is enabled by symlink:start and disabled by symlink:stop how can i achieve that the script runs after symlink :start and :stop ? when i define order-constraint R1 then R2 this implizit means R1:start , R2:start and R2:stop, R1:stop ? thanks in advance andreas 2013/7/5 andreas graeper > hi, > two nodes active/passive and fetchmail as cronjob shall run on active only. > > i use ocf:heartbeat:symlink to move / rename > > /etc/cron.d/jobs <> /etc/cron.d/jobs.disable > > i read anywhere crond ignores files with dot. > > but new experience: crond needs to restarted or signalled. > > how this is done best within pacemaker ? > is clone for me ? > > > thanks in advance > andreas > > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] crond on both nodes (active/passive) but some jobs on active only
hi, two nodes active/passive and fetchmail as cronjob shall run on active only. i use ocf:heartbeat:symlink to move / rename /etc/cron.d/jobs <> /etc/cron.d/jobs.disable i read anywhere crond ignores files with dot. but new experience: crond needs to restarted or signalled. how this is done best within pacemaker ? is clone for me ? thanks in advance andreas ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Fwd: Java application failover problem
Hello, we are facing the problem with the simple (I hope) cluster configuration with 2 nodes ims0 and ims1 and 3 primitives (no shared storage or something like this where data corruption is a danger): - master-slave Java application ims (to be run normally on both nodes in as master/slave, with our own OCF script) with embedded web server (to be accessed by clients) - ims-ip and ims-ip-src: shared IP address and outgoing address to be run on the ims master solely Below are listed the software versions, crm configuration and portions of corosync log. The problem is that although most of the time the setup works (i.e if master ims application dies, slave one is promoted and ip addresses are remapped) but sometimes when master ims application stops (fails or is killed), the failover does not occur - the slave ims application remains the slave and the shared IP address remains mapped on the node with died ims. I even created a testbed of 2 servers, killing the ims application from cron every 15 minutes on supposed MAIN server to simulate the failure and observe the failover and to replicate the problem (sometimes it works properly for hours/days). For example today (July 4, 23:45 local time) the ims at ims0 was killed, but remained Master - no failover of IP addresses was performed and ims on ims1 remained Slave: Last updated: Fri Jul 5 02:07:18 2013 Last change: Thu Jul 4 23:33:46 2013 Stack: openais Current DC: ims0 - partition with quorum Version: 1.1.7-61a079313275f3e9d0e85671f62c721d32ce3563 2 Nodes configured, 2 expected votes 6 Resources configured. Online: [ ims1 ims0 ] Master/Slave Set: ms-ims [ims] Masters: [ ims0 ] Slaves: [ ims1 ] Clone Set: clone-cluster-mon [cluster-mon] Started: [ ims0 ims1 ] Resource Group: on-ims-master ims-ip (ocf::heartbeat:IPaddr2): Started ims0 ims-ip-src (ocf::heartbeat:IPsrcaddr): Started ims0 The command 'crm node standby' on ims0 did not fix the thing: ims0 remained master (although standby): Node ims0: standby Online: [ ims1 ] Master/Slave Set: ms-ims [ims] ims:0 (ocf::microstepmis:imsMS): Slave ims0 FAILED Slaves: [ ims1 ] Clone Set: clone-cluster-mon [cluster-mon] Started: [ ims1 ] Stopped: [ cluster-mon:0 ] Failed actions: ims:0_demote_0 (node=ims0, call=3179, rc=7, status=complete): not running Stoppping openais service on ims0 completely did the thing. Could someone provide me with a hint, what to do ? - provide more information (logs, ocf script) ? - change something in configuration ? - change the environment / versions ? Thanks a lot Martin Gazak Software versions: -- libpacemaker3-1.1.7-42.1 pacemaker-1.1.7-42.1 corosync-1.4.3-21.1 libcorosync4-1.4.3-21.1 SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 2 Configuration: -- node ims0 \ attributes standby="off" node ims1 \ attributes standby="off" primitive cluster-mon ocf:pacemaker:ClusterMon \ params htmlfile="/opt/ims/tomcat/webapps/ims/html/crm_status.html" \ op monitor interval="10" primitive ims ocf:microstepmis:imsMS \ op monitor interval="1" role="Master" timeout="20" \ op monitor interval="2" role="Slave" timeout="20" \ op start interval="0" timeout="1800s" \ op stop interval="0" timeout="120s" \ op promote interval="0" timeout="180s" \ meta failure-timeout="360s" primitive ims-ip ocf:heartbeat:IPaddr2 \ params ip="192.168.141.13" nic="bond1" iflabel="ims" cidr_netmask="24" \ op monitor interval="15s" \ meta failure-timeout="60s" primitive ims-ip-src ocf:heartbeat:IPsrcaddr \ params ipaddress="192.168.141.13" cidr_netmask="24" \ op monitor interval="15s" \ meta failure-timeout="60s" group on-ims-master ims-ip ims-ip-src ms ms-ims ims \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" migration-threshold="1" clone clone-cluster-mon cluster-mon colocation ims_master inf: on-ims-master ms-ims:Master order ms-ims-before inf: ms-ims:promote on-ims-master:start property $id="cib-bootstrap-options" \ dc-version="1.1.7-61a079313275f3e9d0e85671f62c721d32ce3563" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ cluster-recheck-interval="1m" \ default-resource-stickiness="1000" \ last-lrm-refresh="1372951736" \ maintenance-mode="false" corosync.log from ims0: --- Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM operation ims:0_monitor_1000 (call=3046, rc=7, cib-update=6229, confirmed=false) not running Jul 04 23:45:02 ims0 crmd: [3935]: info: process_graph_event: Detected action ims:0_monitor_1000 from a different transition: 4024 vs. 4035 Jul 04 23:45:02 ims0 crmd: [3935]: i
[Pacemaker] Another question about fencing/stonithing
Hi all, I just wrote a stonith agent which IMHO implements the API spec found at https://fedorahosted.org/cluster/wiki/FenceAgentAPI. But it seems it has a problem when used as pacemaker stonith device. What has to be done, to have a stonith/fencing agent which implements both roles. I'm pretty sure something is missing. It's just a guess that it has something to do with listing "registered" agents. What is a registered stonith agent and what is done while registering it? When I configure my own fencing agent as packemaker stonith device and try to do a "stonith_admin --list=nodename" I get a "no such device" error. Any pointer appreciated. Best regards Andreas Mock ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org