Re: [Pacemaker] pcs equivalent of crm configure erase
Hi all, I try to bring that topic up once again because it's still unresolved for me: a) How can I do the equivalent of 'crm configure erase' in pcs? Is there a way? b) If I can't do it woith pcs, is there a reliable and secure way to do it with pacemaker low level tools? Thank you in advance. Best regards Andreas -Ursprüngliche Nachricht- Von: Andrew Beekhof [mailto:and...@beekhof.net] Gesendet: Montag, 15. April 2013 05:49 An: The Pacemaker cluster resource manager Betreff: Re: [Pacemaker] pcs equivalent of crm configure erase On 14/04/2013, at 5:52 PM, Andreas Mock andreas.m...@web.de wrote: Hi all, can someone tell me what the pcs equivalent to crm configure erase is? Is there a pcs cheat sheet showing the common tasks? Or a documentation? pcs help should be reasonably informative, but I don't see anything equivalent Chris? Best regards Andreas ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] cman+pacemaker - way to join/leave cluster?
Hi, Can anyone point me to the correct procedure of adding/removing node to/from cluster in cman+pacemaker stack, preferably without stopping whole cluster? pcs cluster node add|remove doesn't work, as there's no pcsd daemon in centos6.4; cman tools (cman_tool, ccs_tool, ccs_sync) - doesn't work either, as it expects configured pki infrastructure as ricci service does... I can just edit CIB cluster.conf and remove/add node by hands, but it seems kind of crude way - requires restart of cmanpacemaker on all nodes -- Yuriy Demchenko ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] cman+pacemaker - way to join/leave cluster?
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Pacemaker_Explained/index.html#_options_2 example 2.4 # cibadmin --query tmp.xml # vi tmp.xml # cibadmin --replace --xml-file tmp.xml - Original Message - Hi, Can anyone point me to the correct procedure of adding/removing node to/from cluster in cman+pacemaker stack, preferably without stopping whole cluster? pcs cluster node add|remove doesn't work, as there's no pcsd daemon in centos6.4; cman tools (cman_tool, ccs_tool, ccs_sync) - doesn't work either, as it expects configured pki infrastructure as ricci service does... I can just edit CIB cluster.conf and remove/add node by hands, but it seems kind of crude way - requires restart of cmanpacemaker on all nodes -- Yuriy Demchenko ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- -- Daniel Black, Engineer @ Open Query (http://openquery.com) Remote expertise maintenance for MySQL/MariaDB server environments. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] cman+pacemaker - way to join/leave cluster?
yea, that's what i meant by edit CIBcluster.conf by hands however, two problems with it: 1. it requires restart of cman on each node to forget expelled node after cluster.conf edit 2. after editing CIB (removing all entries belonging to expelled node) and replacing CIB via pcs cluster push cib.xml (or cibadmin --replace; doesn't matter) cluster DC still automaticly inserts node and node_status entries for expelled node which I cannot cleanup. node_state id=node-2 uname=node-2 in_ccm=false crmd=offline join=down expected=down crm-debug-origin=do_cib_replaced/ Yuriy Demchenko On 04/16/2013 12:16 PM, Daniel Black wrote: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Pacemaker_Explained/index.html#_options_2 example 2.4 # cibadmin --query tmp.xml # vi tmp.xml # cibadmin --replace --xml-file tmp.xml - Original Message - Hi, Can anyone point me to the correct procedure of adding/removing node to/from cluster in cman+pacemaker stack, preferably without stopping whole cluster? pcs cluster node add|remove doesn't work, as there's no pcsd daemon in centos6.4; cman tools (cman_tool, ccs_tool, ccs_sync) - doesn't work either, as it expects configured pki infrastructure as ricci service does... I can just edit CIB cluster.conf and remove/add node by hands, but it seems kind of crude way - requires restart of cmanpacemaker on all nodes -- Yuriy Demchenko ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] resource-stickness-issue
Hi All, I have created cluster with these versions in fedora 17. pacemaker-1.1.7-2.fc17.x86_64 corosync-2.0.0-1.fc17.x86_64 Everything is working fine for me except resource stickiness. Any idea on this ? Regards, Rauthan Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pcs equivalent of crm configure erase
On Tue, Apr 16, 2013 at 9:38 AM, Andreas Mock andreas.m...@web.de wrote: Hi all, I try to bring that topic up once again because it's still unresolved for me: a) How can I do the equivalent of 'crm configure erase' in pcs? Is there a way? b) If I can't do it woith pcs, is there a reliable and secure way to do it with pacemaker low level tools? I don't think so. cibadmin has a drastic version of erase, but this is probably not what you want. If you don't want to use any higher level tools, the best way is to probably make a loop and use pcs to remove the resources, since it also removes also the constraints, not sure about other objects. something like: for r in `crm_resource -l`; do pcs resource delete $r; done But test it first, I haven't used pcs myself yet. Rasto -- Dipl.-Ing. Rastislav Levrinc rasto.levr...@gmail.com Linux Cluster Management Console http://lcmc.sf.net/ ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] pcs: Return code handling not clean
Hi all, as I don't really know, where to address this issue, I do post it here. On the one handside as an information for guys scripting with the help of 'pcs' and on the other handside with the hope that one maintainer is listening and will have a look at this. Problem: When cluster is down a 'pcs resource' shows an error message coming from a subprocess call of 'crm_resource -L' but exits with an error code of 0. That's something which can be improved. Especially while the python code does have error handling in other paces. So I guess it is a simple oversight. Look at the following piece of code in pcs/resource.py: 915 if len(argv) == 0: 916 args = [crm_resource,-L] 917 output,retval = utils.run(args) 918 preg = re.compile(r'.*(stonith:.*)') 919 for line in output.split('\n'): 920 if not preg.match(line) and line != : 921 print line 922 return retval is totally ignored, while being handled on other places. That leads to the fact that the script returns with status 0. Interestingly the error handling of the utils.run call used all over the module is IMHO a little bit inconsistent. If I remember correctly Andrew did some efforts in the past to have a set of return codes comming from the base cibXXX and crm_XXX tools. (I really don't know how much they are differentiated). Why not pass them through? Best regards Andreas Mock ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Pacemaker configuration with different dependencies
Hi guys, I need some help with pacemaker configuration, it is all new to me and can't find solution... I have two-node HA environment with services that I want to be partially independent, in pacemaker/heartbeat configuration. There is active/active sip service with two floating IPs, it should all just migrate floating ip when one sip dies. There is also two active/active master/slave services with java container and rdbms with replication between them, should also fallback when one dies. What I can't figure out how to configure those two to be independent (put on-fail directive on group). What I want is to, e.g., in case my sip service fails, java container stays active on that node, but floating ip to be moved to other node. Another thing is, in case one of rdbms fails, I want to put whole service group on that node to standby, but leave sip service intact. Whole node should go to standby (all services down) only when L3_ping to gateway dies. All suggestions and configuration examples are welcome. Thanks in advance. Ivor Prebeg smime.p7s Description: S/MIME cryptographic signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker configuration with different dependencies
Hi Ivor, I don't know whether I understand you completely right: If you want independence of resources don't put them into a group. Look at http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explain ed/ch10.html A group is made to tie together several resources without declaring all necessary colocations and orderings to get a desired behaviour. Otherwise. Name your resources ans how they should be spread across your cluster. (Show the technical dependency) Best regards Andreas Von: Ivor Prebeg [mailto:ivor.pre...@gmail.com] Gesendet: Dienstag, 16. April 2013 13:53 An: pacemaker@oss.clusterlabs.org Betreff: [Pacemaker] Pacemaker configuration with different dependencies Hi guys, I need some help with pacemaker configuration, it is all new to me and can't find solution... I have two-node HA environment with services that I want to be partially independent, in pacemaker/heartbeat configuration. There is active/active sip service with two floating IPs, it should all just migrate floating ip when one sip dies. There is also two active/active master/slave services with java container and rdbms with replication between them, should also fallback when one dies. What I can't figure out how to configure those two to be independent (put on-fail directive on group). What I want is to, e.g., in case my sip service fails, java container stays active on that node, but floating ip to be moved to other node. Another thing is, in case one of rdbms fails, I want to put whole service group on that node to standby, but leave sip service intact. Whole node should go to standby (all services down) only when L3_ping to gateway dies. All suggestions and configuration examples are welcome. Thanks in advance. Ivor Prebeg ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pcs equivalent of crm configure erase
Hi Rastislav, thank you for your hints. In this case, only to rely on pcs, I could probably use the following to get the list of resources: pcs resource show --all | perl -M5.010 -ane 'say $F[1] if $F[0] eq Resource:' Best regards Andreas Mock -Ursprüngliche Nachricht- Von: Rasto Levrinc [mailto:rasto.levr...@gmail.com] Gesendet: Dienstag, 16. April 2013 10:45 An: The Pacemaker cluster resource manager Betreff: Re: [Pacemaker] pcs equivalent of crm configure erase On Tue, Apr 16, 2013 at 9:38 AM, Andreas Mock andreas.m...@web.de wrote: Hi all, I try to bring that topic up once again because it's still unresolved for me: a) How can I do the equivalent of 'crm configure erase' in pcs? Is there a way? b) If I can't do it woith pcs, is there a reliable and secure way to do it with pacemaker low level tools? I don't think so. cibadmin has a drastic version of erase, but this is probably not what you want. If you don't want to use any higher level tools, the best way is to probably make a loop and use pcs to remove the resources, since it also removes also the constraints, not sure about other objects. something like: for r in `crm_resource -l`; do pcs resource delete $r; done But test it first, I haven't used pcs myself yet. Rasto -- Dipl.-Ing. Rastislav Levrinc rasto.levr...@gmail.com Linux Cluster Management Console http://lcmc.sf.net/ ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker 1.1.8, Corosync, No CMAN, Promotion issues
On Fri, Apr 12, 2013 at 9:27 AM, pavan tc pavan...@gmail.com wrote: Absolutely none in the syslog. Only the regular monitor logs from my resource agent which continued to report as secondary. This is very strange, because the thing that caused the I_PE_CALC is a timer that goes off every 15 minutes. Which would seem to imply that there was a transition of some kind about when the failure happened - but somehow it didnt go into the logs. Could you post the complete logs from 14:00 to 14:30? Sure. Here goes. Attached are two logs and corosync.conf - 1. syslog (Edited, messages from other modules removed. I have not touched the pacemaker/corosync related messages) 2 corosync.log (Unedited) 3 corosync.conf Wanted to mention a couple of things: -- 14:06 is when the system was coming back up from a reboot. I have started from the earliest message during boot to the point the I_PE_CALC timer popped and a promote was called. -- I see the following during boot up. Does that mean pacemaker did not start? Apr 10 14:06:26 corosync [pcmk ] info: process_ais_conf: Enabling MCP mode: Use the Pacemaker init script to complete Pacemaker startup Could that contribute to any of this behaviour? I'll be glad to provide any other information. Did anybody get a chance to look at the information attached in the previous email? Thanks, Pavan Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] warning: reload: operation not recognized
Hi, On Tue, Apr 09, 2013 at 08:23:43PM +, Xavier Lashmar wrote: Hi, I am investigating XML parsing errors appearing in my /var/log/cluster/corosync.log files. While doing so, I've been using the crm tool to 'view' my configuration and I get the following interesting errors: crm(live)# configure WARNING: reload: operation not recognized It's just not in the list of possible operations. And that check seems to be misplaced a bit. Will fix that. Thanks, Dejan --- pacemaker / corosync versions --- # rpm -qa | grep pacemaker pacemaker-cluster-libs-1.1.8-7.el6.x86_64 pacemaker-1.1.8-7.el6.x86_64 pacemaker-libs-1.1.8-7.el6.x86_64 pacemaker-cli-1.1.8-7.el6.x86_64 # rpm -qa | grep corosync corosync-1.4.1-15.el6.x86_64 corosynclib-1.4.1-15.el6.x86_64 --- Any ideas what might be causing the 'operation not recognized' - these commands work perfectly well on another cluster running pacemaker 1.1.7-6 and corosync 1.4.1-7. For example: # crm crm(live)# configure crm(live)configure# [Description: Description: cid:D85E51EA-D618-4CBC-9F88-34F696123DED] Xavier Lashmar Analyste de Systèmes | Systems Analyst Service étudiants, service de l'informatique et des communications/Student services, computing and communications services. 1 Nicholas Street (810) Ottawa ON K1N 7B7 Tél. | Tel. 613-562-5800 (2120) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pcs: Return code handling not clean
On 04/16/13 06:46, Andreas Mock wrote: Hi all, as I don't really know, where to address this issue, I do post it here. On the one handside as an information for guys scripting with the help of 'pcs' and on the other handside with the hope that one maintainer is listening and will have a look at this. Problem: When cluster is down a 'pcs resource' shows an error message coming from a subprocess call of 'crm_resource -L' but exits with an error code of 0. That's something which can be improved. Especially while the python code does have error handling in other paces. So I guess it is a simple oversight. Look at the following piece of code in pcs/resource.py: 915 if len(argv) == 0: 916 args = [crm_resource,-L] 917 output,retval = utils.run(args) 918 preg = re.compile(r'.*(stonith:.*)') 919 for line in output.split('\n'): 920 if not preg.match(line) and line != : 921 print line 922 return retval is totally ignored, while being handled on other places. That leads to the fact that the script returns with status 0. This is an oversight on my part, I've updated the code to check retval and return an error. Currently I'm not passing through the full error code (I'm only returning 0 on success and 1 on failure). However, if you think it would be useful to have this information I would be happy to look at it and see what I can do. I'm planning on eventually having pcs interpret the crm_resource error code and provide a more user friendly output instead of just a return code. Thanks, Chris Interestingly the error handling of the utils.run call used all over the module is IMHO a little bit inconsistent. If I remember correctly Andrew did some efforts in the past to have a set of return codes comming from the base cibXXX and crm_XXX tools. (I really don't know how much they are differentiated). Why not pass them through? Best regards Andreas Mock ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pcs: Return code handling not clean
On 17/04/2013, at 8:33 AM, Chris Feist cfe...@redhat.com wrote: On 04/16/13 06:46, Andreas Mock wrote: Hi all, as I don't really know, where to address this issue, I do post it here. On the one handside as an information for guys scripting with the help of 'pcs' and on the other handside with the hope that one maintainer is listening and will have a look at this. Problem: When cluster is down a 'pcs resource' shows an error message coming from a subprocess call of 'crm_resource -L' but exits with an error code of 0. That's something which can be improved. Especially while the python code does have error handling in other paces. So I guess it is a simple oversight. Look at the following piece of code in pcs/resource.py: 915 if len(argv) == 0: 916 args = [crm_resource,-L] 917 output,retval = utils.run(args) 918 preg = re.compile(r'.*(stonith:.*)') 919 for line in output.split('\n'): 920 if not preg.match(line) and line != : 921 print line 922 return retval is totally ignored, while being handled on other places. That leads to the fact that the script returns with status 0. This is an oversight on my part, I've updated the code to check retval and return an error. Currently I'm not passing through the full error code (I'm only returning 0 on success and 1 on failure). However, if you think it would be useful to have this information I would be happy to look at it and see what I can do. I'm planning on eventually having pcs interpret the crm_resource error code and provide a more user friendly output instead of just a return code. there is a crm_perror binary that might be useful for this Thanks, Chris Interestingly the error handling of the utils.run call used all over the module is IMHO a little bit inconsistent. If I remember correctly Andrew did some efforts in the past to have a set of return codes comming from the base cibXXX and crm_XXX tools. (I really don't know how much they are differentiated). Why not pass them through? Best regards Andreas Mock ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pcs equivalent of crm configure erase
On 04/14/13 02:52, Andreas Mock wrote: Hi all, can someone tell me what the pcs equivalent to crm configure erase is? From my understanding, 'crm configure erase' will remove everything from the configuration file except for the nodes. Are you trying to clear your configuration out and start from scratch? pcs has a destroy command (pcs cluster destroy), which will remove all pacemaker/corosync configuration and allow you to create your cluster from scratch. Is this what you're looking for? Or do you need a specific command to keep the cluster running, but reset the cib to its defaults? Thanks! Chris Is there a pcs cheat sheet showing the common tasks? Or a documentation? Best regards Andreas ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Cleanup over secondary node
Ho Andrew. On Monday, 15 April 2013 14:36:48 +1000, Andrew Beekhof wrote: I'm testing Pacemaker+Corosync cluster with KVM virtual machines. When restarting a node, I got the following status: # crm status Last updated: Sun Apr 14 11:50:00 2013 Last change: Sun Apr 14 11:49:54 2013 Stack: openais Current DC: daedalus - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 8 Resources configured. Online: [ atlantis daedalus ] Resource Group: servicios fs_drbd_servicios (ocf::heartbeat:Filesystem):Started daedalus clusterIP (ocf::heartbeat:IPaddr2): Started daedalus Mysql (ocf::heartbeat:mysql): Started daedalus Apache (ocf::heartbeat:apache):Started daedalus Pure-FTPd (ocf::heartbeat:Pure-FTPd): Started daedalus Asterisk (ocf::heartbeat:asterisk): Started daedalus Master/Slave Set: drbd_serviciosClone [drbd_servicios] Masters: [ daedalus ] Slaves: [ atlantis ] Failed actions: Asterisk_monitor_0 (node=atlantis, call=12, rc=5, status=complete): not installed The problem is that if I do a cleanup of the Asterisk resource in the secondary, this has no effect. It seems to be Paceemaker needs to have access to the config file to the resource. Not Pacemaker, the resource agent. Pacemaker runs a non-recurring monitor operation to see what state the service is in, it seems the asterisk agent needs that config file. I'd suggest changing the agent so that if the asterisk process is not running, the agent returns 7 (not running) before trying to access the config file. I was reviewing the resource definition assuming there I might have made some reference to the Asterisk configuration file, but this was not the case: primitive Asterisk ocf:heartbeat:asterisk \ params realtime=true \ op monitor interval=60s \ meta target-role=Started This agent is the one that is available in the resource-agents package from Debian Backports repository: atlantis:~# aptitude show resource-agents Paquete: resource-agents Nuevo: sí Estado: instalado Instalado automáticamente: sí Versión: 1:3.9.2-5~bpo60+1 Prioridad: opcional Sección: admin Desarrollador: Debian HA Maintainers debian-ha-maintain...@lists.alioth.debian.org Tamaño sin comprimir: 2.228 k Depende de: libc6 (= 2.4), libglib2.0-0 (= 2.12.0), libnet1 (= 1.1.2.1), libplumb2, libplumbgpl2, cluster-glue, python Tiene conflictos con: cluster-agents (= 1:1.0.4-1), rgmanager (= 3.0.12-2+b1) Reemplaza: cluster-agents (= 1:1.0.4-1), rgmanager (= 3.0.12-2+b1) Descripción: Cluster Resource Agents The Cluster Resource Agents are a set of scripts to interface with several services to operate in a High Availability environment for both Pacemaker and rgmanager resource managers. Página principal: https://github.com/ClusterLabs/resource-agents Do you know if there is any way to get the behavior that you suggested me using this agent? Thanks for your reply. Regards, Daniel -- Ing. Daniel Bareiro - GNU/Linux registered user #188.598 Proudly running Debian GNU/Linux with uptime: 21:54:06 up 52 days, 6:01, 11 users, load average: 0.00, 0.02, 0.00 signature.asc Description: Digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Cleanup over secondary node
On 17/04/2013, at 11:28 AM, Daniel Bareiro daniel-lis...@gmx.net wrote: Ho Andrew. On Monday, 15 April 2013 14:36:48 +1000, Andrew Beekhof wrote: I'm testing Pacemaker+Corosync cluster with KVM virtual machines. When restarting a node, I got the following status: # crm status Last updated: Sun Apr 14 11:50:00 2013 Last change: Sun Apr 14 11:49:54 2013 Stack: openais Current DC: daedalus - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 8 Resources configured. Online: [ atlantis daedalus ] Resource Group: servicios fs_drbd_servicios (ocf::heartbeat:Filesystem):Started daedalus clusterIP (ocf::heartbeat:IPaddr2): Started daedalus Mysql (ocf::heartbeat:mysql): Started daedalus Apache (ocf::heartbeat:apache):Started daedalus Pure-FTPd (ocf::heartbeat:Pure-FTPd): Started daedalus Asterisk (ocf::heartbeat:asterisk): Started daedalus Master/Slave Set: drbd_serviciosClone [drbd_servicios] Masters: [ daedalus ] Slaves: [ atlantis ] Failed actions: Asterisk_monitor_0 (node=atlantis, call=12, rc=5, status=complete): not installed The problem is that if I do a cleanup of the Asterisk resource in the secondary, this has no effect. It seems to be Paceemaker needs to have access to the config file to the resource. Not Pacemaker, the resource agent. Pacemaker runs a non-recurring monitor operation to see what state the service is in, it seems the asterisk agent needs that config file. I'd suggest changing the agent so that if the asterisk process is not running, the agent returns 7 (not running) before trying to access the config file. I was reviewing the resource definition assuming there I might have made some reference to the Asterisk configuration file, but this was not the case: primitive Asterisk ocf:heartbeat:asterisk \ params realtime=true \ op monitor interval=60s \ meta target-role=Started This agent is the one that is available in the resource-agents package from Debian Backports repository: atlantis:~# aptitude show resource-agents Paquete: resource-agents Nuevo: sí Estado: instalado Instalado automáticamente: sí Versión: 1:3.9.2-5~bpo60+1 Prioridad: opcional Sección: admin Desarrollador: Debian HA Maintainers debian-ha-maintain...@lists.alioth.debian.org Tamaño sin comprimir: 2.228 k Depende de: libc6 (= 2.4), libglib2.0-0 (= 2.12.0), libnet1 (= 1.1.2.1), libplumb2, libplumbgpl2, cluster-glue, python Tiene conflictos con: cluster-agents (= 1:1.0.4-1), rgmanager (= 3.0.12-2+b1) Reemplaza: cluster-agents (= 1:1.0.4-1), rgmanager (= 3.0.12-2+b1) Descripción: Cluster Resource Agents The Cluster Resource Agents are a set of scripts to interface with several services to operate in a High Availability environment for both Pacemaker and rgmanager resource managers. Página principal: https://github.com/ClusterLabs/resource-agents Do you know if there is any way to get the behavior that you suggested me using this agent? You'll have to edit it and submit the changes upstream. If whatever it is looking for is not found when a monitor is requested, it should probably return 7 (STOPPED) Thanks for your reply. Regards, Daniel -- Ing. Daniel Bareiro - GNU/Linux registered user #188.598 Proudly running Debian GNU/Linux with uptime: 21:54:06 up 52 days, 6:01, 11 users, load average: 0.00, 0.02, 0.00 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Question about the error when fencing failed
This should solve your issue: https://github.com/beekhof/pacemaker/commit/dbbb6a6 On 11/04/2013, at 7:23 PM, Kazunori INOUE inouek...@intellilink.co.jp wrote: Hi Andrew, (13.04.08 11:04), Andrew Beekhof wrote: On 05/04/2013, at 3:21 PM, Kazunori INOUE inouek...@intellilink.co.jp wrote: Hi, When fencing failed (*1) on the following conditions, an error occurs in stonith_perform_callback(). - using fencing-topology. (*2) - fence DC node. ($ crm node fence dev2) Apr 3 17:04:47 dev2 stonith-ng[2278]: notice: handle_request: Client crmd.2282.b9e69280 wants to fence (reboot) 'dev2' with device '(any)' Apr 3 17:04:47 dev2 stonith-ng[2278]: notice: handle_request: Forwarding complex self fencing request to peer dev1 Apr 3 17:04:47 dev2 stonith-ng[2278]: info: stonith_command: Processed st_fence from crmd.2282: Operation now in progress (-115) Apr 3 17:04:47 dev2 pengine[2281]: warning: process_pe_message: Calculated Transition 2: /var/lib/pacemaker/pengine/pe-warn-0.bz2 Apr 3 17:04:47 dev2 stonith-ng[2278]: info: stonith_command: Processed st_query from dev1: OK (0) Apr 3 17:04:47 dev2 stonith-ng[2278]: info: stonith_action_create: Initiating action list for agent fence_legacy (target=(null)) Apr 3 17:04:47 dev2 stonith-ng[2278]: info: stonith_command: Processed st_timeout_update from dev1: OK (0) Apr 3 17:04:47 dev2 stonith-ng[2278]: info: dynamic_list_search_cb: Refreshing port list for f-dev1 Apr 3 17:04:48 dev2 stonith-ng[2278]: notice: remote_op_done: Operation reboot of dev2 by dev1 for crmd.2282@dev1.4494ed41: Generic Pacemaker error Apr 3 17:04:48 dev2 stonith-ng[2278]: info: stonith_command: Processed st_notify reply from dev1: OK (0) Apr 3 17:04:48 dev2 crmd[2282]:error: crm_abort: stonith_perform_callback: Triggered assert at st_client.c:1894 : call_id 0 Apr 3 17:04:48 dev2 crmd[2282]:error: stonith_perform_callback: Bad result st-reply st_origin=stonith_construct_reply t=stonith-ng st_rc=-201 st_op=st_query st_callid=0 st_clientid=b9e69280-e557-478e-aa94-fd7ca6a533b1 st_clientname=crmd.2282 st_remote_op=4494ed41-2306-4707-8406-fa066b7f3ef0 st_callopt=0 st_delegate=dev1 Apr 3 17:04:48 dev2 crmd[2282]:error: stonith_perform_callback: Bad result st_calldata Apr 3 17:04:48 dev2 crmd[2282]:error: stonith_perform_callback: Bad result st-reply t=st_notify subt=broadcast st_op=reboot count=1 src=dev1 state=4 st_target=dev2 Apr 3 17:04:48 dev2 crmd[2282]:error: stonith_perform_callback: Bad result st_calldata Apr 3 17:04:48 dev2 crmd[2282]:error: stonith_perform_callback: Bad result st_notify_fence state=4 st_rc=-201 st_target=dev2 st_device_action=reboot st_delegate=dev1 st_remote_op=4494ed41-2306-4707-8406-fa066b7f3ef0 st_origin=dev1 st_clientid=b9e69280-e557-478e-aa94-fd7ca6a533b1 st_clientname=crmd.2282/ Apr 3 17:04:48 dev2 crmd[2282]:error: stonith_perform_callback: Bad result /st_calldata Apr 3 17:04:48 dev2 crmd[2282]:error: stonith_perform_callback: Bad result /st-reply Apr 3 17:04:48 dev2 crmd[2282]:error: stonith_perform_callback: Bad result /st_calldata Apr 3 17:04:48 dev2 crmd[2282]:error: stonith_perform_callback: Bad result /st-reply Apr 3 17:04:48 dev2 crmd[2282]: warning: stonith_perform_callback: STONITH command failed: Generic Pacemaker error Apr 3 17:04:48 dev2 crmd[2282]: notice: tengine_stonith_notify: Peer dev2 was not terminated (st_notify_fence) by dev1 for dev1: Generic Pacemaker error (ref=4494ed41-2306-4707-8406-fa066b7f3ef0) by client crmd.2282 Apr 3 17:07:11 dev2 crmd[2282]:error: stonith_async_timeout_handler: Async call 2 timed out after 144000ms Is this the designed behavior? Definitely not :-( Is this the first fencing operation that has been initiated by the cluster? Yes. I attached crm_report. Or has the cluster been running for some time? Best Regards, Kazunori INOUE *1: I added exit 1 to reset() of stonith-plugin in order to make fencing fail. $ diff -u libvirt.ORG libvirt --- libvirt.ORG 2012-12-17 09:56:37.0 +0900 +++ libvirt 2013-04-03 16:33:08.118157947 +0900 @@ -240,6 +240,7 @@ ;; reset) +exit 1 libvirt_check_config libvirt_set_domain_id $2 *2: node $id=3232261523 dev2 node $id=3232261525 dev1 primitive f-dev1 stonith:external/libvirt \ params pcmk_reboot_retries=1 hostlist=dev1 \ hypervisor_uri=qemu+ssh://bl460g1n5/system primitive f-dev2 stonith:external/libvirt \ params pcmk_reboot_retries=1 hostlist=dev2 \ hypervisor_uri=qemu+ssh://bl460g1n6/system location rsc_location-f-dev1 f-dev1 \ rule $id=rsc_location-f-dev1-rule -inf: #uname eq dev1 location rsc_location-f-dev2 f-dev2 \ rule $id=rsc_location-f-dev2-rule -inf: #uname eq dev2 fencing_topology \
Re: [Pacemaker] Question about recovery policy after Too many failures to fence
On 11/04/2013, at 7:23 PM, Kazunori INOUE inouek...@intellilink.co.jp wrote: Hi Andrew, (13.04.08 12:01), Andrew Beekhof wrote: On 27/03/2013, at 7:45 PM, Kazunori INOUE inouek...@intellilink.co.jp wrote: Hi, I'm using pacemaker-1.1 (c7910371a5. the latest devel). When fencing failed 10 times, S_TRANSITION_ENGINE state is kept. (related: https://github.com/ClusterLabs/pacemaker/commit/e29d2f9) How should I recover? what kind of procedure should I make S_IDLE in? The intention was that the node should proceed to S_IDLE when this occurs, so you shouldn't have to do anything and the cluster would try again once the recheck-interval expired or a config change was made. I assume you're saying this does not occur? I recognize that the timer of cluster-recheck-interval is invalid between S_TRANSITION_ENGINE. So even if waited for a long time, it was still S_TRANSITION_ENGINE. * I attached crm_report. I think https://github.com/beekhof/pacemaker/commit/ef8068e9 should fix this part of the problem. What do I have to do in order to make the cluster retry STONITH? For example, I need to run 'crmadmin -E' to change config? Best Regards, Kazunori INOUE Mar 27 15:34:34 dev2 crmd[17937]: notice: tengine_stonith_callback: Stonith operation 12/22:14:0:0927a8a0-8e09-494e-acf8-7fb273ca8c9e: Generic Pacemaker error (-1001) Mar 27 15:34:34 dev2 crmd[17937]: notice: tengine_stonith_callback: Stonith operation 12 for dev2 failed (Generic Pacemaker error): aborting transition. Mar 27 15:34:34 dev2 crmd[17937]: info: abort_transition_graph: tengine_stonith_callback:426 - Triggered transition abort (complete=0) : Stonith failed Mar 27 15:34:34 dev2 crmd[17937]: notice: tengine_stonith_notify: Peer dev2 was not terminated (st_notify_fence) by dev1 for dev2: Generic Pacemaker error (ref=05f75ab8-34ae-4aae-bbc6-aa20dbfdc845) by client crmd.17937 Mar 27 15:34:34 dev2 crmd[17937]: notice: run_graph: Transition 14 (Complete=1, Pending=0, Fired=0, Skipped=8, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped Mar 27 15:34:34 dev2 crmd[17937]: notice: too_many_st_failures: Too many failures to fence dev2 (11), giving up $ crmadmin -S dev2 Status of crmd@dev2: S_TRANSITION_ENGINE (ok) $ crm_mon Last updated: Wed Mar 27 15:35:12 2013 Last change: Wed Mar 27 15:33:16 2013 via cibadmin on dev1 Stack: corosync Current DC: dev2 (3232261523) - partition with quorum Version: 1.1.10-1.el6-c791037 2 Nodes configured, unknown expected votes 3 Resources configured. Node dev2 (3232261523): UNCLEAN (online) Online: [ dev1 ] prmDummy (ocf::pacemaker:Dummy): Started dev2 FAILED Resource Group: grpStonith1 prmStonith1(stonith:external/stonith-helper): Started dev2 Resource Group: grpStonith2 prmStonith2(stonith:external/stonith-helper): Started dev1 Failed actions: prmDummy_monitor_1 (node=dev2, call=23, rc=7, status=complete): not running Best Regards, Kazunori INOUE ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org too-many-failures-to-fence.tar.bz2___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pcs equivalent of crm configure erase
Hi Chris, I would like to see something where you can start your pacemaker configuration (only) from scratch. In a way, so that you know nothing is left (constraints, etc.). Best regards Andreas -Ursprüngliche Nachricht- Von: Chris Feist [mailto:cfe...@redhat.com] Gesendet: Mittwoch, 17. April 2013 00:23 An: The Pacemaker cluster resource manager Cc: Andreas Mock Betreff: Re: [Pacemaker] pcs equivalent of crm configure erase On 04/14/13 02:52, Andreas Mock wrote: Hi all, can someone tell me what the pcs equivalent to crm configure erase is? From my understanding, 'crm configure erase' will remove everything from the configuration file except for the nodes. Are you trying to clear your configuration out and start from scratch? pcs has a destroy command (pcs cluster destroy), which will remove all pacemaker/corosync configuration and allow you to create your cluster from scratch. Is this what you're looking for? Or do you need a specific command to keep the cluster running, but reset the cib to its defaults? Thanks! Chris Is there a pcs cheat sheet showing the common tasks? Or a documentation? Best regards Andreas ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?
On 15/04/2013, at 7:08 PM, Pavlos Parissis pavlos.paris...@gmail.com wrote: Hoi, I upgraded 1st node and here are the logs https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node1.debuglog https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node2.debuglog Enabling tracing on the mentioned functions didn't give at least to me any more information. 10:22:08 pacemakerd[53588]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Thats the file(s) we need :) Cheers, Pavlos On 15 April 2013 01:42, Andrew Beekhof and...@beekhof.net wrote: On 15/04/2013, at 7:31 AM, Pavlos Parissis pavlos.paris...@gmail.com wrote: On 12/04/2013 09:37 μμ, Pavlos Parissis wrote: Hoi, As I wrote to another post[1] I failed to upgrade to 1.1.8 for a 2 node cluster. Before the upgrade process both nodes are using CentOS 6.3, corosync 1.4.1-7 and pacemaker-1.1.7. I followed the rolling upgrade process, so I stopped pacemaker and then corosync on node1 and upgraded to CentOS 6.4. The OS upgrade upgrades also pacemaker to 1.1.8-7 and corosync to 1.4.1-15. The upgrade of rpms went smoothly as I knew about the crmsh issue so I made sure I had crmsh rpm on my repos. Corosync started without any problems and both nodes could see each other[2]. But for some reason node2 failed to receive a reply on join offer from node1 and node1 never joined the cluster. Node1 formed a new cluster as it never got an reply from node2, so I ended up with a split-brain situation. Logs of node1 can be found here https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node1.log and of node2 here https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node2.log Doing a Disconnect Reattach upgrade of both nodes at the same time brings me a working 1.1.8 cluster. Any attempt to make a 1.1.8 node to join a cluster with a 1.1.7 failed. There wasn't enough detail in the logs to suggest a solution, but if you add the following to /etc/sysconfig/pacemaker and re-test, it might shed some additional light on the problem. export PCMK_trace_functions=ais_dispatch_message Certainly there was no intention to make them incompatible. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pcs: Return code handling not clean
Hi Chris, just seen in the github repo - which I found after posting here - that you made a fix. Thank you for the very fast reaction. Best regards Andreas -Ursprüngliche Nachricht- Von: Chris Feist [mailto:cfe...@redhat.com] Gesendet: Mittwoch, 17. April 2013 00:34 An: The Pacemaker cluster resource manager; Andreas Mock Betreff: Re: [Pacemaker] pcs: Return code handling not clean On 04/16/13 06:46, Andreas Mock wrote: Hi all, as I don't really know, where to address this issue, I do post it here. On the one handside as an information for guys scripting with the help of 'pcs' and on the other handside with the hope that one maintainer is listening and will have a look at this. Problem: When cluster is down a 'pcs resource' shows an error message coming from a subprocess call of 'crm_resource -L' but exits with an error code of 0. That's something which can be improved. Especially while the python code does have error handling in other paces. So I guess it is a simple oversight. Look at the following piece of code in pcs/resource.py: 915 if len(argv) == 0: 916 args = [crm_resource,-L] 917 output,retval = utils.run(args) 918 preg = re.compile(r'.*(stonith:.*)') 919 for line in output.split('\n'): 920 if not preg.match(line) and line != : 921 print line 922 return retval is totally ignored, while being handled on other places. That leads to the fact that the script returns with status 0. This is an oversight on my part, I've updated the code to check retval and return an error. Currently I'm not passing through the full error code (I'm only returning 0 on success and 1 on failure). However, if you think it would be useful to have this information I would be happy to look at it and see what I can do. I'm planning on eventually having pcs interpret the crm_resource error code and provide a more user friendly output instead of just a return code. Thanks, Chris Interestingly the error handling of the utils.run call used all over the module is IMHO a little bit inconsistent. If I remember correctly Andrew did some efforts in the past to have a set of return codes comming from the base cibXXX and crm_XXX tools. (I really don't know how much they are differentiated). Why not pass them through? Best regards Andreas Mock ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemakerd does not daemonize?
On 10/04/2013, at 2:21 PM, Andrei Belov defana...@gmail.com wrote: On Apr 10, 2013, at 2:06 , Andrew Beekhof and...@beekhof.net wrote: On 09/04/2013, at 4:13 PM, Andrei Belov defana...@gmail.com wrote: Hello pacemaker users, I noticed that neither -p nor -f option does not make any sense for pacemakerd - pid_file is never used, and f option marked as Legacy. Is the ability to run as a daemon disappeared completely? Is pacemakerd insufficient? This is what the init script uses. That's ok, I just was a little confused by meaningless options in pacemakerd --help. I've updated it to: [03:27 PM] beekhof@f17 ~/Development/sources/pacemaker/devel ☺ # mcp/pacemakerd --help pacemakerd - Start/Stop Pacemaker Usage: pacemakerd mode [options] Options: -?, --help This text -$, --version Version information -V, --verbose Increase debug output -S, --shutdown Instruct Pacemaker to shutdown on this machine -F, --features Display the full version and list of features Pacemaker was built with Additional Options: -f, --foreground (Ignored) Pacemaker always runs in the foreground -p, --pid-file=value (Ignored) Daemon pid file location Report bugs to pacemaker@oss.clusterlabs.org Also I'd like to know if there are any reasons to worry about the following: Absolutely... four processes crashed/aborted. Apr 08 19:54:20 [6025] pacemakerd: info: pcmk_child_exit: Child process crmd exited (pid=6031, rc=0) Apr 08 19:54:20 [6025] pacemakerd: info: pcmk_child_exit: Child process pengine exited (pid=6030, rc=0) Apr 08 19:54:24 [6025] pacemakerd: notice: pcmk_child_exit: Child process attrd terminated with signal 6 (pid=6029, core=128) Apr 08 19:54:29 [6025] pacemakerd: notice: pcmk_child_exit: Child process lrmd terminated with signal 6 (pid=6028, core=128) Apr 08 19:54:33 [6025] pacemakerd: notice: pcmk_child_exit: Child process stonith-ng terminated with signal 6 (pid=6027, core=128) Apr 08 19:54:38 [6025] pacemakerd: notice: pcmk_child_exit: Child process cib terminated with signal 6 (pid=6026, core=128) Why some helper daemons could be terminated using abort() ? Something _really_ bad happened. I suspect something wrong with pacemaker + libqb and QB_IPC_SOCKET. Would appreciate any advices - my knowledge of pacemaker/libqb internals is very limited. It looks like the reason for abort() is somewhere in qb_ipcs_connection_unref(): This is on non-linux right? I think Angus was of the opinion that $thing_i_cant_remember did reference counting a bit differently on non-linux. I'm not sure he made much progress with it. Can you confirm which arch this is before we continue? Core was generated by `/opt/local/libexec/pacemaker/attrd'. Program terminated with signal 6, Aborted. #0 0xfd7fff0e061a in _lwp_kill () from /lib/64/libc.so.1 (gdb) bt #0 0xfd7fff0e061a in _lwp_kill () from /lib/64/libc.so.1 #1 0xfd7fff0d4ddd in thr_kill () from /lib/64/libc.so.1 #2 0xfd7fff06a971 in raise () from /lib/64/libc.so.1 #3 0xfd7fff0400a1 in abort () from /lib/64/libc.so.1 #4 0xfd7fff0403f5 in _assert () from /lib/64/libc.so.1 #5 0xfd7fc021274e in qb_ipcs_connection_unref () from /opt/local/lib/libqb.so.0 #6 0x004044f9 in main () Core was generated by `/opt/local/libexec/pacemaker/cib'. Program terminated with signal 6, Aborted. #0 0xfd7fff0f061a in _lwp_kill () from /lib/64/libc.so.1 (gdb) bt #0 0xfd7fff0f061a in _lwp_kill () from /lib/64/libc.so.1 #1 0xfd7fff0e4ddd in thr_kill () from /lib/64/libc.so.1 #2 0xfd7fff07a971 in raise () from /lib/64/libc.so.1 #3 0xfd7fff0500a1 in abort () from /lib/64/libc.so.1 #4 0xfd7fff0503f5 in _assert () from /lib/64/libc.so.1 #5 0xfd7fc021274e in qb_ipcs_connection_unref () from /opt/local/lib/libqb.so.0 #6 0x00410438 in cib_shutdown () #7 0xfd7fbfc5533f in crm_signal_dispatch (source=0x49be80, callback=optimized out, userdata=optimized out) at mainloop.c:203 #8 0xfd7fc555f9e0 in g_main_context_dispatch () from /opt/local/lib/libglib-2.0.so.0 #9 0xfd7fc555fd40 in g_main_context_iterate.isra.24 () from /opt/local/lib/libglib-2.0.so.0 #10 0xfd7fc5560152 in g_main_loop_run () from /opt/local/lib/libglib-2.0.so.0 #11 0x00411056 in cib_init () #12 0x0041163e in main () Core was generated by `/opt/local/libexec/pacemaker/lrmd'. Program terminated with signal 6, Aborted. #0 0xfd7fff0e061a in _lwp_kill () from /lib/64/libc.so.1 (gdb) bt #0 0xfd7fff0e061a in _lwp_kill () from /lib/64/libc.so.1 #1 0xfd7fff0d4ddd in thr_kill () from /lib/64/libc.so.1 #2 0xfd7fff06a971 in raise () from /lib/64/libc.so.1 #3 0xfd7fff0400a1 in abort () from /lib/64/libc.so.1 #4 0xfd7fff0403f5 in _assert () from /lib/64/libc.so.1 #5