Re: [Pacemaker] Node fails to rejoin cluster
On 02/08/2013 04:59 AM, Andrew Beekhof wrote: Suggests it s a bug that got fixed recently. Keep an eye out for 1.1.9 in the next week or so (or you could try building from source if you're in a hurry). Is 1.1.9 will be centos 5.x friendly? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Periodically appear non-existent nodes
On 04/17/2012 03:46 PM, ruslan usifov wrote: 2012/4/17 Andreas Kurz andr...@hastexo.com mailto:andr...@hastexo.com On 04/14/2012 11:14 PM, ruslan usifov wrote: Hello I remove 2 nodes from cluster, with follow sequence: crm_node --force -R id of node1 crm_node --force -R id of node2 cibadmin --delete --obj_type nodes --crm_xml 'node uname=node1/' cibadmin --delete --obj_type status --crm_xml 'node_state uname=node1/' cibadmin --delete --obj_type nodes --crm_xml 'node uname=node2/' cibadmin --delete --obj_type status --crm_xml 'node_state uname=node2/' Nodes after this deleted, but if for example i restart (reboot) one of existent nodes in working cluster, this deleted nodes appear again in OFFLINE state I have this problem some time ago. I solved it something like that: crm node delete NODENAME crm_node --force --remove NODENAME cibadmin --delete --obj_type nodes --crm_xml 'node uname=NODENAME/' cibadmin --delete --obj_type status --crm_xml 'node_state uname=NODENAME/' -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Questions about reasonable cluster size...
On 10/20/2011 03:15 AM, Steven Dake wrote: On 10/19/2011 01:50 PM, Alan Robertson wrote: Hi, I have an application where having a 12-node cluster with about 250 resources would be desirable. Is this reasonable? Can Pacemaker+Corosync be expected to reliably handle a cluster of this size? If not, what is the current recommendation for maximum number of nodes and resources? I start to have problems with 10+ nodes. It`s heavly depended on corosync configuration afaik. You should test it. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue
Hello Beekhof. First of all - I don`t want to waste your time but this problem is realy important for me and I can`t solve it by my self and it`s looks like a bug or something. I think what I fail at describing of this problem so I will try again and try to make a sum of all prev conversation. I have a situation then pacemaker thinks what resource are running but it`s not. Agent from console said it`s not running. I have no fencing and this resource are fail to stop by timeout. And you said what it`s a reason of this situation. But I made an experiment and found what if pcmk can`t stop resource it make it unmanaged My resource was not unmanaged - it`s just say what they are running and I have no indication of problem. We already fix this non stoppable scripts but I want to be sure what I will not run on this problem any more. Below some quotes from prev conversation if needed. 12.10.2011 6:11, Andrew Beekhof пишет: On 10/03/2011 05:32 AM, Andrew Beekhof wrote: corosync-1.4.1 pacemaker-1.1.5 pacemaker runs with ver: 1 2) This one is scary. I twice run on situation then pacemaker thinks what resource is started but it is not. RA is misbehaving. Pacemaker will only consider a resource running if the RA tells us it is (running or in a failed state). But you can see below, what agent return 7. Its still broken. Not one stop action succeeds. Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 4082) timed out (try 1). Killing with signal SIGTERM (15). Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 21859) timed out (try 1). Killing with signal SIGTERM (15). Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 24576) timed out (try 1). Killing with signal SIGTERM (15). /That/ is why pacemaker thinks its still running. I made an experiment. I create script what don`t die at SIGTERM #!/usr/bin/perl $SIG{TERM} = IGNORE; sleep 1 while 1 And run it on pacemaker. I run 3 tests: 1) primitive test-kill-15.pl ocf:mail.ru:generic \ op monitor interval=20 timeout=5 on-fail=restart \ params binfile=/tmp/test-kill-15.pl external_pidfile=1 2) Same but on-fail=block 3) Same but with metaware stonith. Each time I do: crm resource stop test-kill-15.pl And in case 1 and 2 - I get unmanaged on this resource. Because you've not configured any fencing devices. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue
On 10/05/2011 04:19 AM, Andrew Beekhof wrote: On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: On 10/03/2011 05:32 AM, Andrew Beekhof wrote: corosync-1.4.1 pacemaker-1.1.5 pacemaker runs with ver: 1 2) This one is scary. I twice run on situation then pacemaker thinks what resource is started but it is not. RA is misbehaving. Pacemaker will only consider a resource running if the RA tells us it is (running or in a failed state). But you can see below, what agent return 7. Its still broken. Not one stop action succeeds. Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 4082) timed out (try 1). Killing with signal SIGTERM (15). Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 21859) timed out (try 1). Killing with signal SIGTERM (15). Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 24576) timed out (try 1). Killing with signal SIGTERM (15). /That/ is why pacemaker thinks its still running. Hm, I think in this situation it must become unmanaged, no? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue
On 10/05/2011 04:19 AM, Andrew Beekhof wrote: On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: On 10/03/2011 05:32 AM, Andrew Beekhof wrote: corosync-1.4.1 pacemaker-1.1.5 pacemaker runs with ver: 1 2) This one is scary. I twice run on situation then pacemaker thinks what resource is started but it is not. RA is misbehaving. Pacemaker will only consider a resource running if the RA tells us it is (running or in a failed state). But you can see below, what agent return 7. Its still broken. Not one stop action succeeds. Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 4082) timed out (try 1). Killing with signal SIGTERM (15). Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 21859) timed out (try 1). Killing with signal SIGTERM (15). Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 24576) timed out (try 1). Killing with signal SIGTERM (15). /That/ is why pacemaker thinks its still running. I made an experiment. I create script what don`t die at SIGTERM #!/usr/bin/perl $SIG{TERM} = IGNORE; sleep 1 while 1 And run it on pacemaker. I run 3 tests: 1) primitive test-kill-15.pl ocf:mail.ru:generic \ op monitor interval=20 timeout=5 on-fail=restart \ params binfile=/tmp/test-kill-15.pl external_pidfile=1 2) Same but on-fail=block 3) Same but with metaware stonith. Each time I do: crm resource stop test-kill-15.pl And in case 1 and 2 - I get unmanaged on this resource. In case 3 I get stonith situation. From IRC: (12:20:44 PM) beekhof: Oloremo: what the hell is the cluster supposed to do if stop fails and you dont want fencing? it cant start it anywhere because its still active in the original location (12:30:09 PM) Oloremo: I get the point, really. But may be it should make it unmanaged? And it does. So can I assume what my problem with monitoring still not that clear? I don`t get unmanaged - it is just thinks that resource are started but it`s not. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue
On 10/03/2011 05:32 AM, Andrew Beekhof wrote: corosync-1.4.1 pacemaker-1.1.5 pacemaker runs with ver: 1 2) This one is scary. I twice run on situation then pacemaker thinks what resource is started but it is not. RA is misbehaving. Pacemaker will only consider a resource running if the RA tells us it is (running or in a failed state). But you can see below, what agent return 7. We use slightly modifed version of anything agent for our scripts but they are aware of OCF return codes and other staff. I run monitoring by our agent from console: # env -i ; OCF_ROOT=/usr/lib/ocf OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl /usr/lib/ocf/resource.d/mail.ru/generic monitor # generic[14992]: DEBUG: default monitor : 7 So our agent said what it is not running, but pacemaker still think it does. I runs for 2 days and after I forced to cleanup it. And it find what it`snot running in seconds. Did you configure a recurring monitor operation? Of course. I add my primitive configuration in original letter there is: op monitor interval=30 timeout=300 on-fail=restart \ I have this third time and this time I found in logs: Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice: unpack_rsc_op: Ignoring expired failure tranprocessor_stop_0 (rc=-2, magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on mysender34.mail.ru There is different resource name cos logs from third situation but problem is same. 3) This one it confusing and dangerous. I use failure-timeout on most resources to wipe out temp warn messages from crm_verify -LV - I use it for monitoring a cluster. All works good but I found this: 1) Resource can`t start on node and migrate to next one. 2) It can`t start here too and on all other. 3) It is give up and stops. There is many erros about all this in crm_verify -LV - and it is good. 4) failure-timeout comes and... wipe out all errors. 5) We have stopped resource and all errors are wiped. And we don`t know if it is stopped by a hands of admin or because of errors. I think what failure-timeout should not happend on stopped resource. Any chance to avoid this? Not sure why you think this is dangerous, the cluster is doing exactly what you told it to. If you want resources to stay stopped either set failure-timeout=0 (disabled) or set the target-role to Stopped. No, I want to use failure-timeout but not wipe out errors then resource are already stopped by pacemaker because of errors and not by admin hands. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Ignoring expired failure
Hello all. corosync-1.4.1 pacemaker-1.1.5 pacemaker runs with ver: 1 I run again on monitoring fail and still don`t know why it happends. Details are here: http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09986.html Some info: I twice run on situation then pacemaker thinks what resource is started but it is not. We use slightly modifed version of anything agent for our scripts but they are aware of OCF return codes and other staff. I run monitoring by our agent from console: # env -i ; OCF_ROOT=/usr/lib/ocf OCF_RESKEY_binfile=/usr/local/mpop/bin/my/tranprocessor.pl /usr/lib/ocf/resource.d/mail.ru/generic monitor # generic[14992]: DEBUG: default monitor : 7 But this time I see in logs: Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice: unpack_rsc_op: Ignoring expired failure tranprocessor_stop_0 (rc=-2, magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on mysender34.mail.ru So Pacemaker knows what resource may be down but ignoring it. Why? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue
Hello all. corosync-1.4.1 pacemaker-1.1.5 pacemaker runs with ver: 1 I run on some problems this week. I not sure if I need to make 3 separate letters, sorry if so. 1) I set node to standby and then to online. And after this I get this: 2643 root RT 0 11424 2052 1744 R 100.9 0.0 657502:53 /usr/lib/heartbeat/stonithd 2644 hacluste RT 0 12432 3440 2240 R 100.9 0.0 657502:43 /usr/lib/heartbeat/cib 2648 hacluste RT 0 11828 2860 2456 R 100.9 0.0 657502:45 /usr/lib/heartbeat/crmd 2646 hacluste RT 0 11764 2240 1904 R 99.9 0.0 657502:49 /usr/lib/heartbeat/attrd I was in hurry and it`s a production server, so I kill this proc and stop pacemakerd corosync. Then start them again. And all was ok. I suppose what pacemakerd and corosync was running while this problems occurs. I assume this cos then I run stop on they init scripts it is takes some time till they stop. Any hints? 2) This one is scary. I twice run on situation then pacemaker thinks what resource is started but it is not. We use slightly modifed version of anything agent for our scripts but they are aware of OCF return codes and other staff. I run monitoring by our agent from console: # env -i ; OCF_ROOT=/usr/lib/ocf OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl /usr/lib/ocf/resource.d/mail.ru/generic monitor # generic[14992]: DEBUG: default monitor : 7 So our agent said what it is not running, but pacemaker still think it does. I runs for 2 days and after I forced to cleanup it. And it find what it`snot running in seconds. This is really scary situation. I can`t reproduce it but I already have it twice... may be more but I not see it, who knows. I attach out agent script and that is how we run this script: primitive dialogues_notify.pl ocf:mail.ru:generic \ op monitor interval=30 timeout=300 on-fail=restart \ op start interval=0 timeout=300 \ op stop interval=0 timeout=300 \ params binfile=/usr/local/mpop/bin/my/dialogues_notify.pl \ meta failure-timeout=120 3) This one it confusing and dangerous. I use failure-timeout on most resources to wipe out temp warn messages from crm_verify -LV - I use it for monitoring a cluster. All works good but I found this: 1) Resource can`t start on node and migrate to next one. 2) It can`t start here too and on all other. 3) It is give up and stops. There is many erros about all this in crm_verify -LV - and it is good. 4) failure-timeout comes and... wipe out all errors. 5) We have stopped resource and all errors are wiped. And we don`t know if it is stopped by a hands of admin or because of errors. I think what failure-timeout should not happend on stopped resource. Any chance to avoid this? -- Best regards, Proskurin Kirill #!/bin/sh ### # Initialization: : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} . ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs if [ ! -z $OCF_RESKEY_binfile ]; then basename=`basename ${OCF_RESKEY_binfile} .pl` OCF_RESKEY_pidfile_default=/var/run/${basename}.pid OCF_RESKEY_logfile_default=/var/log/${basename}.log fi OCF_RESKEY_external_pidfile_default=0 OCF_RESKEY_core_dump_default=0 : ${OCF_RESKEY_pidfile=$OCF_RESKEY_pidfile_default} : ${OCF_RESKEY_logfile=$OCF_RESKEY_logfile_default} : ${OCF_RESKEY_external_pidfile=$OCF_RESKEY_external_pidfile_default} : ${OCF_RESKEY_core_dump=$OCF_RESKEY_core_dump_default} ### generic_usage() { cat END usage: $0 {start|stop|monitor|validate-all|meta-data} Expects to have a fully populated OCF RA-compliant environment set. END } generic_meta() { cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=generic version1.0/version longdesc lang=en Resource agent for any script /longdesc shortdesc lang=enResource agent for any script/shortdesc parameters parameter name=binfile required=1 longdesc lang=en The full name of the binary to be executed. /longdesc shortdesc lang=enFull path name of the binary to be executed/shortdesc content type=string / /parameter parameter name=options required=0 longdesc lang=en Command line options to pass to the binary /longdesc shortdesc lang=enCommand line options/shortdesc content type=string / /parameter parameter name=pidfile longdesc lang=en Path to pidfile. Default is: /var/run/\${basename}.pid /longdesc shortdesc lang=enPath to pidfile/shortdesc content type=string default=${OCF_RESKEY_pidfile_default}/ /parameter parameter name=logfile longdesc lang=en Path to logfile. Default is: /var/log/\${basename}.log /longdesc shortdesc lang=enPath to logfile/shortdesc content type=string default=${OCF_RESKEY_logfile_default}/ /parameter parameter name=external_pidfile longdesc lang=en Write pidfile by ocf-agent, not running script. /longdesc shortdesc lang=enWho writes pidfile/shortdesc content type=boolean default
Re: [Pacemaker] Cluster type is: corosync
01.08.2011 5:42, Andrew Beekhof пишет: Finally, tell Corosync to load the Pacemaker plugin.n As I said before: And I run pacemakerd after corosync start. Anyway - problem is solved for me. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster type is: corosync
02.08.2011 1:00, Andrew Beekhof пишет: On Mon, Aug 1, 2011 at 10:23 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: 01.08.2011 5:42, Andrew Beekhof пишет: Finally, tell Corosync to load the Pacemaker plugin.n As I said before: And I run pacemakerd after corosync start. The two are not mutually exclusive. You need the plugin AND pacemakerd. I has service.d/pcmk just like in example. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster type is: corosync
27.07.2011 6:41, Andrew Beekhof пишет: Ok. And did you add the pacemaker configuration options to corosync's config file? I attach our corosync.conf. It is same on all nodes except IP addr. You missed a step from: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/s-configure-corosync.html Witch one? At previous conversation Steven Dake said what I can use exact ip addr if I wish(And I do - because some node may have more then one ip addr on the same network). And I run pacemakerd after corosync start. I can`t say for sure but seems to I fix it by turning compatibility: none. After this it start to tell me what here are Cluster type: openais. Pacemaker is black now - no configuration at all. Online nodes: [root@mysender1 ~]# crm configure show node mysender1.example.com node mysender2.example.com node mysender3.example.com node mysender4.example.com node mysender5.example.com node mysender6.example.com node mysender7.example.com property $id=cib-bootstrap-options \ dc-version=1.1.5-3-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \ cluster-infrastructure=openais \ expected-quorum-votes=6 Offline nodes(Cluster type is: corosync) [root@mysender2 ~]# crm configure show [root@mysender2 ~]# pacemaker-1.1.5 corosync-1.4.0 cluster-glue-1.0.6 openais-1.1.2 All nodes have same rpms. On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill k.prosku...@corp.example.com wrote: Hello again! Hope I`m not flooding too much here but I have another problem. I install same rpm of corosync, openais, pacemaker, cluster_glue on all nodes. I check it twice. And then I start some of they - they can`t connect to cluster and stays offline. In logs I see what they see other nodes and connectivity is ok. But I found the difference: Online nodes in cluster have: [root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 20:38:58 mysender39.example.com stonith-ng: [3499]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.example.com attrd: [3502]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.example.com cib: [3500]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:59 mysender39.example.com crmd: [3504]: info: get_cluster_type: Cluster type is: 'openais'. Offline have: [root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 13:39:17 mysender2.example.com stonith-ng: [9028]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.example.com attrd: [9031]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.example.com cib: [9029]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:18 mysender2.example.com crmd: [9033]: info: get_cluster_type: Cluster type is: 'corosync'. What`s wrong and how can I fix it? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Upgrading from 1.0 to 1.1
27.07.2011 5:56, Andrew Beekhof пишет: On Tue, Jul 19, 2011 at 5:40 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: On 07/19/2011 03:22 AM, Andrew Beekhof wrote: On Fri, Jul 15, 2011 at 10:33 PM, Proskurin Kirill k.prosku...@corp.mail.ruwrote: Hello all. I found what I using corosync with pacemaker ver:0 with installed pacemaker 1.1.5 - eg without start a pacemakerd. Sounds wrong. :-) So I try to upgrade. I shutdown one node. Change 0 to 1 on service.d/pcmk Start corosync and then start pacemakerd via init script. But this node stays online and on clusters DC I see: cib: [18392]: WARN: cib_peer_callback: Discarding cib_sync_one message (255) from mysender10.example.com: not in our membership Thats odd. The only you changed was ver: 0 to ver: 1 ? Yes, only this. To make it more clear - I have 4 nodes with ver 0 and try to add one with ver 1 and got this. Well I shutdown all nodes change all to 1 and star them up add all was ok. Not a really good way to upgrade but I don`t have time. Do you still have the logs for the failure case? I'd really like to see them. No I don`t. But some time ago I got same error on vise-versa situation - then I try to add node with ver: 0 to cluster there all nodes are ver: 1 Anyway my cluster are down now so I can do some test. I will sent logs to maillist if I reproduce this situation again. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster type is: corosync
On 07/26/2011 11:00 AM, Andrew Beekhof wrote: On Mon, Jul 25, 2011 at 7:18 PM, Proskurin Kirill k.prosku...@corp.example.com wrote: 25.07.2011 10:10, Andrew Beekhof пишет: Which packages are you using? It is your official source from repository I build. Ok. And did you add the pacemaker configuration options to corosync's config file? I attach our corosync.conf. It is same on all nodes except IP addr. Pacemaker is black now - no configuration at all. Online nodes: [root@mysender1 ~]# crm configure show node mysender1.example.com node mysender2.example.com node mysender3.example.com node mysender4.example.com node mysender5.example.com node mysender6.example.com node mysender7.example.com property $id=cib-bootstrap-options \ dc-version=1.1.5-3-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \ cluster-infrastructure=openais \ expected-quorum-votes=6 Offline nodes(Cluster type is: corosync) [root@mysender2 ~]# crm configure show [root@mysender2 ~]# pacemaker-1.1.5 corosync-1.4.0 cluster-glue-1.0.6 openais-1.1.2 All nodes have same rpms. On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill k.prosku...@corp.example.comwrote: Hello again! Hope I`m not flooding too much here but I have another problem. I install same rpm of corosync, openais, pacemaker, cluster_glue on all nodes. I check it twice. And then I start some of they - they can`t connect to cluster and stays offline. In logs I see what they see other nodes and connectivity is ok. But I found the difference: Online nodes in cluster have: [root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 20:38:58 mysender39.example.com stonith-ng: [3499]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.example.com attrd: [3502]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.example.com cib: [3500]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:59 mysender39.example.com crmd: [3504]: info: get_cluster_type: Cluster type is: 'openais'. Offline have: [root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 13:39:17 mysender2.example.com stonith-ng: [9028]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.example.com attrd: [9031]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.example.com cib: [9029]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:18 mysender2.example.com crmd: [9033]: info: get_cluster_type: Cluster type is: 'corosync'. What`s wrong and how can I fix it? -- Best regards, Proskurin Kirill totem { version: 2 token: 2500 token_retransmits_before_loss_const: 10 join: 100 consensus: 3000 vsftype: none max_messages: 20 send_join: 45 secauth:off fail_recv_const: 5000 interface { ringnumber: 0 bindnetaddr: 10.6.1.155 mcastaddr: 239.255.1.1 mcastport: 5405 ttl: 31 } } logging { fileline: off to_syslog: no to_stderr: no to_logfile: yes logfile: /var/log/corosync.log debug: off timestamp: on } amf { mode: disabled } ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster type is: corosync
25.07.2011 10:10, Andrew Beekhof пишет: Which packages are you using? It is your official source from repository I build. pacemaker-1.1.5 corosync-1.4.0 cluster-glue-1.0.6 openais-1.1.2 All nodes have same rpms. On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: Hello again! Hope I`m not flooding too much here but I have another problem. I install same rpm of corosync, openais, pacemaker, cluster_glue on all nodes. I check it twice. And then I start some of they - they can`t connect to cluster and stays offline. In logs I see what they see other nodes and connectivity is ok. But I found the difference: Online nodes in cluster have: [root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 20:38:58 mysender39.mail.ru stonith-ng: [3499]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.mail.ru attrd: [3502]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.mail.ru cib: [3500]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:59 mysender39.mail.ru crmd: [3504]: info: get_cluster_type: Cluster type is: 'openais'. Offline have: [root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 13:39:17 mysender2.mail.ru stonith-ng: [9028]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.mail.ru attrd: [9031]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.mail.ru cib: [9029]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:18 mysender2.mail.ru crmd: [9033]: info: get_cluster_type: Cluster type is: 'corosync'. What`s wrong and how can I fix it? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster type is: corosync
Hello. I update openais to latest 1.1.4 but this not helps at all. Google know nothing about it. I run of ideas. 25.07.2011 13:18, Proskurin Kirill пишет: 25.07.2011 10:10, Andrew Beekhof пишет: Which packages are you using? It is your official source from repository I build. pacemaker-1.1.5 corosync-1.4.0 cluster-glue-1.0.6 openais-1.1.2 All nodes have same rpms. On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: Hello again! Hope I`m not flooding too much here but I have another problem. I install same rpm of corosync, openais, pacemaker, cluster_glue on all nodes. I check it twice. And then I start some of they - they can`t connect to cluster and stays offline. In logs I see what they see other nodes and connectivity is ok. But I found the difference: Online nodes in cluster have: [root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 20:38:58 mysender39.mail.ru stonith-ng: [3499]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.mail.ru attrd: [3502]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.mail.ru cib: [3500]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:59 mysender39.mail.ru crmd: [3504]: info: get_cluster_type: Cluster type is: 'openais'. Offline have: [root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 13:39:17 mysender2.mail.ru stonith-ng: [9028]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.mail.ru attrd: [9031]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.mail.ru cib: [9029]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:18 mysender2.mail.ru crmd: [9033]: info: get_cluster_type: Cluster type is: 'corosync'. What`s wrong and how can I fix it? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Sending message via cpg FAILED: (rc=12) Doesn't exist
Hello all. pacemaker-1.1.5 corosync-1.4.0 4 nodes in cluster. 3 online 1 not. In logs: Jul 22 11:50:23 my106.example.com crmd: [28030]: info: pcmk_quorum_notification: Membership 0: quorum retained (0) Jul 22 11:50:23 my106.example.com crmd: [28030]: info: do_started: Delaying start, no membership data (0010) Jul 22 11:50:23 my106.example.com crmd: [28030]: info: config_query_callback: Shutdown escalation occurs after: 120ms Jul 22 11:50:23 my106.example.com crmd: [28030]: info: config_query_callback: Checking for expired actions every 90ms Jul 22 11:50:23 my106.example.com crmd: [28030]: info: do_started: Delaying start, no membership data (0010) Jul 22 11:50:27 my106.example.com attrd: [28028]: info: cib_connect: Connected to the CIB after 1 signon attempts Jul 22 11:50:27 my106.example.com attrd: [28028]: info: cib_connect: Sending full refresh Jul 22 11:52:18 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jul 22 11:52:18 corosync [CPG ] chosen downlist: sender r(0) ip(10.3.1.107) ; members(old:4 left:1) Jul 22 11:52:18 corosync [MAIN ] Completed service synchronization, ready to provide service. Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR: send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR: send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR: send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist DC: Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee Jul 22 11:50:07 my107.example.com pacemakerd: [22388]: info: update_node_processes: Node my106.example.com now has process list: 0002 (was 00 12) Jul 22 11:50:07 my107.example.com attrd: [22397]: info: crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00 02 (new) Jul 22 11:50:07 my107.example.com cib: [22395]: info: crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=0002 (new) Jul 22 11:50:07 my107.example.com stonith-ng: [22394]: info: crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=0 002 (new) Jul 22 11:50:07 my107.example.com crmd: [22399]: info: crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=000 2 (new) Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee There is a problem? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Sending message via cpg FAILED: (rc=12) Doesn't exist
22.07.2011 20:30, Steven Dake пишет: On 07/22/2011 01:15 AM, Proskurin Kirill wrote: Hello all. pacemaker-1.1.5 corosync-1.4.0 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee There is a problem? Does your retransmit list continually display e4 e5 etc for rest of cluster lifetime, or is this short lived? Yes it continually display this. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Upgrading from 1.0 to 1.1
On 07/19/2011 03:22 AM, Andrew Beekhof wrote: On Fri, Jul 15, 2011 at 10:33 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: Hello all. I found what I using corosync with pacemaker ver:0 with installed pacemaker 1.1.5 - eg without start a pacemakerd. Sounds wrong. :-) So I try to upgrade. I shutdown one node. Change 0 to 1 on service.d/pcmk Start corosync and then start pacemakerd via init script. But this node stays online and on clusters DC I see: cib: [18392]: WARN: cib_peer_callback: Discarding cib_sync_one message (255) from mysender10.example.com: not in our membership Thats odd. The only you changed was ver: 0 to ver: 1 ? Yes, only this. To make it more clear - I have 4 nodes with ver 0 and try to add one with ver 1 and got this. Well I shutdown all nodes change all to 1 and star them up add all was ok. Not a really good way to upgrade but I don`t have time. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Upgrading from 1.0 to 1.1
Hello all. I found what I using corosync with pacemaker ver:0 with installed pacemaker 1.1.5 - eg without start a pacemakerd. Sounds wrong. :-) So I try to upgrade. I shutdown one node. Change 0 to 1 on service.d/pcmk Start corosync and then start pacemakerd via init script. But this node stays online and on clusters DC I see: cib: [18392]: WARN: cib_peer_callback: Discarding cib_sync_one message (255) from mysender10.example.com: not in our membership Is there is a way to upgrade all nodes one by one without shutdown all cluster? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Timeout, interval onfail questions
On 07/10/2011 02:53 PM, Lars Marowsky-Bree wrote: 2) I wish to my resources are *never* go to fail status. I found on-fail=restart option but it is not seems to work as I expected. So, for example, if some node under high LA and monitoring of resource is fail - pacemaker will try to run stop action but because of high LA it will timeout too and pacemaker decide what resource is unmanaged. How can I tune this behaviour? I wish pacemaker not to give up and try again. Repeating the same thing over and over again and expecting the result to change is one of the clinical tests for irrational and insane behaviour. So pacemaker doesn't do that. ;-) stop isn't supposed to fail, we don't support retrying it, and will not. :-) Well - this is not quite true. Because env can change - eg LA is start to go low. Well I think I will use some cron job for this. Fix it so that it doesn't fail; if it fails due to a too short timeout, make the timeout longer. Sad thing - this host have huge LA time by time and we can`t fix that in near future. Timeout not really helps here(3m by now)... well I don`t really try to make it 10m or so. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Timeout, interval onfail questions
Hello all! I trying to understand all logic of pacemaker and have some questions. 1) There is an interval and timeout of monitoring of resource. Situation: Interval is 20s, timeout is 60s. Monitoring action is started but node on load ant is it takes more than 20 sec to get the result - will second monitoring action start or pacemaker understand what he allready have one? 2) I wish to my resources are *never* go to fail status. I found on-fail=restart option but it is not seems to work as I expected. So, for example, if some node under high LA and monitoring of resource is fail - pacemaker will try to run stop action but because of high LA it will timeout too and pacemaker decide what resource is unmanaged. How can I tune this behaviour? I wish pacemaker not to give up and try again. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] SNMP monitoring
On 07/05/2011 12:05 PM, Raoul Bhatia [IPAX] wrote: Proskurin, if you get snmp working, would you kindly post your configuration to the mailinglist? the snmp-topic has popped up several times and it would be nice if we got a working config in the mailinglist archive - or better: in the wiki - as a reference. Ok I get it. You need: snmptrapd pacemaker with snmp support snmptrapd.conf: disableAuthorization yes traphandle SNMPv2-SMI::enterprises.32723.1.1 /tmp/trap.sh traphandle SNMPv2-SMI::enterprises.32723.1.2 /tmp/trap.sh traphandle SNMPv2-SMI::enterprises.32723.1.3 /tmp/trap.sh traphandle SNMPv2-SMI::enterprises.32723.1.4 /tmp/trap.sh traphandle SNMPv2-SMI::enterprises.32723.1.5 /tmp/trap.sh traphandle SNMPv2-SMI::enterprises.32723.1.6 /tmp/trap.sh traphandle SNMPv2-SMI::enterprises.32723.1.7 /tmp/trap.sh /tmp/trap.sh - Any sh script to parse result. For example: #!/bin/sh read host read ip while read oid val do echo -e $host $ip == $oid == $val\n /tmp/trap.out done crm_mon --daemonize -S snmptrapd-ip-addr -- to send traps. OR you can use your monitoring system and send traps directly to it. P.S. This works for me on CentOS 5.x with pacemaker 1.1.5 and snmp-5.3.2. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] SNMP monitoring
Hello all. Im try to figure out how to monitor cluster via SNMP. I understand what I need to use crm_mon -S snmpdtrap-ip but I kind of new at SNMP and still can`t understand how to get it work. Could someone write a simple example? Like a snmpdtrapd config. Or may be more detail one to put it into a pacemaker docs(SNMP chapter is empty there)? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Not connected to AIS
On 06/27/2011 09:15 AM, Andrew Beekhof wrote: On Fri, Jun 24, 2011 at 6:56 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: Hello. I have a strange problem. One node in cluster are not work right. In logs: Jun 23 20:25:25 mysender39.example.com lrmd: [10371]: WARN: For LSB init script, no additional parameters are needed. Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: (onlineconf.init:3:stop:stdout) Stopping onlineconf_updater: Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: (onlineconf.init:3:stop:stdout) [ Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: (onlineconf.init:3:stop:stdout) OK Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: (onlineconf.init:3:stop:stdout) ] Jun 23 20:25:25 mysender39.example.com crmd: [30682]: info: process_lrm_event: LRM operation onlineconf.init:3_stop_0 (call=181, rc=0, cib-update=683339, confirm ed=true) ok Jun 23 20:25:25 mysender39.example.com cib: [30678]: ERROR: send_ais_message: Not connected to AIS And then many errors and this string over and over. Not enough information. Please include a crm_report for the time between 20:20:00 and 20:30:00 on June 23. I attached logs to this mail. Hope it helps. -- Best regards, Proskurin Kirill report.tar.bz2 Description: application/bzip ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Not connected to AIS
Hello. I have a strange problem. One node in cluster are not work right. In logs: Jun 23 20:25:25 mysender39.example.com lrmd: [10371]: WARN: For LSB init script, no additional parameters are needed. Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: (onlineconf.init:3:stop:stdout) Stopping onlineconf_updater: Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: (onlineconf.init:3:stop:stdout) [ Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: (onlineconf.init:3:stop:stdout) OK Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: (onlineconf.init:3:stop:stdout) ] Jun 23 20:25:25 mysender39.example.com crmd: [30682]: info: process_lrm_event: LRM operation onlineconf.init:3_stop_0 (call=181, rc=0, cib-update=683339, confirm ed=true) ok Jun 23 20:25:25 mysender39.example.com cib: [30678]: ERROR: send_ais_message: Not connected to AIS And then many errors and this string over and over. But at crm_mod all seems quite: Last updated: Fri Jun 24 12:35:05 2011 Stack: openais Current DC: mysender6.example.com - partition with quorum Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87 4 Nodes configured, 4 expected votes 7 Resources configured. Online: [ mysender6.example.com mysender31.example.com mysender38.example.com mysender39.example.com ] And clone resource at this not is unmanaged. onlineconf.init:3 (lsb:onlineconf): Started mysender39.example.com (unmanaged) FAILED Failed actions: onlineconf.init:3_monitor_5000 (node=mysender39.example.com, call=180, rc=7, status=complete): not running onlineconf.init:3_stop_0 (node=mysender39.example.com, call=-1, rc=1, status=Timed Out): unknown error At logs: Jun 24 12:43:15 mysender39.example.com attrd: [30680]: WARN: attrd_cib_callback: Update 333725 for fail-count-onlineconf.init:2=(null) failed: Remote node did not respond But if I run it by hands it is answers immediately: # /etc/init.d/onlineconf status onlineconf_updater is stopped I do /etc/init.d/corosync restart I wait for 5 min but it still Waiting for corosync services to unload So i kill with -9 and restart. And all start normal again. What was wrong? Corosync-1.2.7 Pacemaker-1.0.11 -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Resource monitor stop working
Hello all. Another problem. Just find out what one of my clone resource are not work and pacemakers not see this - it is says what all cones are started. If i run status from console - all is ok. I still can`t understand how to fix it. I attached log from DC with really strange problems. My config: node mysender31.example.com node mysender38.example.com node mysender39.example.com node mysender6.example.com primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=10.6.1.214 cidr_netmask=32 nic=eth0:0 \ op monitor interval=15 timeout=30 on-fail=restart primitive cleardb_delete_history_old.init lsb:cleardb_delete_history_old \ op monitor interval=15 timeout=30 on-fail=restart \ meta target-role=Started primitive gettopupdated.init lsb:gettopupdate-my \ op monitor interval=15 timeout=30 on-fail=restart primitive onlineconf.init lsb:onlineconf \ op monitor interval=15 primitive qm_manager.init lsb:qm_manager \ op monitor interval=15 timeout=30 on-fail=restart \ meta target-role=Started primitive qm_master.init lsb:qm_master \ op monitor interval=15 timeout=30 on-fail=restart primitive silverbox-stat.1.init lsb:silverbox-stat.1 \ op monitor interval=15 timeout=30 on-fail=restart \ meta target-role=Started clone gettopupdated.clone gettopupdated.init clone onlineconf.clone onlineconf.init clone qm_master.clone qm_master.init \ meta clone-max=2 location CLEARDB_RUNS_ONLY_ON_MS6 cleardb_delete_history_old.init \ rule $id=CLEARDB_RUNS_ONLY_ON_MS6-rule -inf: #uname ne mysender6.example.com location QM-PREFER-MS39 qm_manager.init 100: mysender39.example.com location QM_MASTER_DENY_MS38 qm_master.clone -inf: mysender38.example.com location QM_MASTER_DENY_MS39 qm_master.clone -inf: mysender39.example.com location SILVERBOX-STAT_RUNS_ONLY_ON_MS38 silverbox-stat.1.init \ rule $id=SILVERBOX-STAT_RUNS_ONLY_ON_MS38-rule -inf: #uname ne mysender38.example.com colocation QM-IP inf: ClusterIP qm_manager.init order IP-Before-Qm inf: ClusterIP qm_manager.init property $id=cib-bootstrap-options \ dc-version=1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87 \ cluster-infrastructure=openais \ expected-quorum-votes=4 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1308909119 -- Best regards, Proskurin Kirill Jun 24 11:27:40 mysender6.example.com pengine: [23744]: info: determine_online_status: Node mysender38.example.com is online Jun 24 11:27:40 mysender6.example.com pengine: [23744]: info: determine_online_status: Node mysender31.example.com is online Jun 24 11:27:40 mysender6.example.com pengine: [23744]: info: determine_online_status: Node mysender39.example.com is online Jun 24 11:27:40 mysender6.example.com pengine: [23744]: info: determine_online_status: Node mysender6.example.com is online Jun 24 11:27:40 mysender6.example.com pengine: [23744]: WARN: unpack_rsc_op: Processing failed op onlineconf.init:2_monitor_5000 on mysender38.example.com: not runni ng (7) Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation ClusterIP_monitor_0 found resource ClusterIP active on mysender38.mail.r u Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation gettopupdated.init:3_monitor_0 found resource gettopupdated.init:3 activ e on mysender38.example.com Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation silverbox-stat.1.init_monitor_0 found resource silverbox-stat.1.init act ive on mysender38.example.com Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation qm_master.init:0_monitor_0 found resource qm_master.init:0 active on mys ender38.example.com Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation cleardb_delete_history_old.init_monitor_0 found resource cleardb_delete_ history_old.init active on mysender38.example.com Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation qm_master.init:1_monitor_0 found resource qm_master.init:1 active on mys ender31.example.com Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation cleardb_delete_history_old.init_monitor_0 found resource cleardb_delete_ history_old.init active on mysender31.example.com Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation onlineconf.init:1_monitor_0 found resource onlineconf.init:1 active on m ysender31.example.com Jun 24 11:27:40 mysender6.example.com pengine: [23744]: WARN: unpack_rsc_op: Processing failed op onlineconf.init:1_monitor_5000 on mysender31.example.com: not runni ng (7) Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation qm_manager.init_monitor_0 found resource qm_manager.init active on mysen der39.example.com Jun 24 11:27:40 mysender6.example.com
[Pacemaker] Deleted nodes returns
Hello all. I have a strange problem. At the beginning of my cluster there is a nodes called mysender38.i and mysender39.i Then I: Stop them Delete all from /var/lib/heartbeat/crm/* crm_node --force --remove NODENAME cibadmin --delete --obj_type nodes --crm_xml 'node uname=NODENAME/' cibadmin --delete --obj_type status --crm_xml 'node_state uname=NODENAME/' Changed they hostname Start them And they are gone and new one a running. But *any time* I make changes in cluster configuration I get this: OFFLINE: [ mysender39.i mysender38.i ] And I need to crm_node --force --remove and so on again to make them disappear. It is a bug or I doing something wrong? pacemaker-1.0.11 corosync-1.2.7 -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Deleted nodes returns
On 06/22/2011 03:41 PM, Florian Haas wrote: On 2011-06-22 12:41, Proskurin Kirill wrote: Hello all. I have a strange problem. At the beginning of my cluster there is a nodes called mysender38.i and mysender39.i Then I: Stop them Delete all from /var/lib/heartbeat/crm/* crm_node --force --remove NODENAME cibadmin --delete --obj_type nodes --crm_xml 'node uname=NODENAME/' cibadmin --delete --obj_type status --crm_xml 'node_state uname=NODENAME/' Changed they hostname Start them And they are gone and new one a running. But *any time* I make changes in cluster configuration I get this: OFFLINE: [ mysender39.i mysender38.i ] And I need to crm_node --force --remove and so on again to make them disappear. It is a bug or I doing something wrong? Why do you do things the hard way rather than simply running crm node deletenode? Well I refer to a docs but I try that too and it is not helps at all. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Hostname issues
Hello all I have 4 nodes - all of them with two nic in two network. All of them have 2 DNS name - one for internal network and one for external. This host *must* have a hostname of external network(for other software to work). Corosync must works on internal nic. But it is ask for uname -n for node name and get external name. How to avoide this? I can`t change hostname to int one and can`t run corosync on ext network. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Groups
Hello all! I`m new to pacemakers and have a small question. I want what my resource will be run on all nodes except some. For example we have 10 nodes: node1-10 I want it running on node1-5 but not on node5-10. I can make a 5 location with -INFINITY: node5 ; -INFINITY: node6 and so on. But it is not the way I want to do this. It is possible to make some kind of group(not pacemaker group) of nodes, resources and so on and just add -INFINITY: groupname? Or may be there is a option to just count it in a row like -INFINITY: node6, node7, node8 ? Or there is other way what I missed? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] FS mount error
]: notice: clone_print: Master/Slave Set: WebData Jul 22 08:18:43 node01 pengine: [1813]: notice: short_print: Masters: [ node02.domain.org ] Jul 22 08:18:43 node01 pengine: [1813]: notice: short_print: Slaves: [ node01.domain.org ] Jul 22 08:18:43 node01 pengine: [1813]: notice: native_print: WebFS#011(ocf::heartbeat:Filesystem):#011Stopped Jul 22 08:18:43 node01 pengine: [1813]: info: get_failcount: WebFS has failed 100 times on node01.domain.org Jul 22 08:18:43 node01 pengine: [1813]: WARN: common_apply_stickiness: Forcing WebFS away from node01.domain.org after 100 failures (max=100) Jul 22 08:18:43 node01 pengine: [1813]: info: native_merge_weights: WebData: Rolling back scores from WebFS Jul 22 08:18:43 node01 pengine: [1813]: info: native_merge_weights: wwwdrbd:0: Rolling back scores from WebFS Jul 22 08:18:43 node01 pengine: [1813]: info: native_merge_weights: WebData: Rolling back scores from WebFS Jul 22 08:18:43 node01 pengine: [1813]: info: master_color: Promoting wwwdrbd:0 (Master node02.domain.org) Jul 22 08:18:43 node01 pengine: [1813]: info: master_color: WebData: Promoted 1 instances of a possible 1 to master Jul 22 08:18:43 node01 pengine: [1813]: info: master_color: Promoting wwwdrbd:0 (Master node02.domain.org) Jul 22 08:18:43 node01 pengine: [1813]: info: master_color: WebData: Promoted 1 instances of a possible 1 to master Jul 22 08:18:43 node01 pengine: [1813]: notice: RecurringOp: Start recurring monitor (60s) for WebSite on node02.domain.org Jul 22 08:18:43 node01 pengine: [1813]: notice: LogActions: Leave resource ClusterIP#011(Started node02.domain.org) Jul 22 08:18:43 node01 pengine: [1813]: notice: LogActions: Start WebSite#011(node02.domain.org) Jul 22 08:18:43 node01 pengine: [1813]: notice: LogActions: Leave resource wwwdrbd:0#011(Master node02.domain.org) Jul 22 08:18:43 node01 pengine: [1813]: notice: LogActions: Leave resource wwwdrbd:1#011(Slave node01.domain.org) Jul 22 08:18:43 node01 pengine: [1813]: notice: LogActions: Start WebFS#011(node02.domain.org) Jul 22 08:18:43 node01 pengine: [1813]: info: process_pe_message: Transition 199: PEngine Input stored in: /var/lib/pengine/pe-input-243.bz2 Jul 22 08:18:44 node01 crmd: [1814]: ERROR: stonithd_signon: Can't initiate connection to stonithd Jul 22 08:18:44 node01 crmd: [1814]: notice: Not currently connected. Jul 22 08:18:44 node01 crmd: [1814]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry Jul 22 08:18:44 node01 crmd: [1814]: info: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jul 22 08:18:44 node01 crmd: [1814]: info: unpack_graph: Unpacked transition 199: 4 actions in 4 synapses Jul 22 08:18:44 node01 crmd: [1814]: info: do_te_invoke: Processing graph 199 (ref=pe_calc-dc-1279783123-729) derived from /var/lib/pengine/pe-input-243.bz2 Jul 22 08:18:44 node01 crmd: [1814]: info: te_rsc_command: Initiating action 42: start WebFS_start_0 on node02.domain.org Jul 22 08:18:44 node01 crmd: [1814]: info: te_rsc_command: Initiating action 5: probe_complete probe_complete on node02.domain.org - no waiting Jul 22 08:18:44 node01 crmd: [1814]: info: te_connect_stonith: Attempting connection to fencing daemon... Jul 22 08:18:45 node01 crmd: [1814]: ERROR: stonithd_signon: Can't initiate connection to stonithd Jul 22 08:18:45 node01 crmd: [1814]: notice: Not currently connected. Jul 22 08:18:45 node01 crmd: [1814]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry Jul 22 08:18:45 node01 crmd: [1814]: info: te_connect_stonith: Attempting connection to fencing daemon... Jul 22 08:18:46 node01 crmd: [1814]: ERROR: stonithd_signon: Can't initiate connection to stonithd Jul 22 08:18:46 node01 crmd: [1814]: notice: Not currently connected. Jul 22 08:18:46 node01 crmd: [1814]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry Jul 22 08:18:46 node01 crmd: [1814]: info: te_connect_stonith: Attempting connection to fencing daemon... Jul 22 08:18:47 node01 crmd: [1814]: ERROR: stonithd_signon: Can't initiate connection to stonithd Jul 22 08:18:47 node01 crmd: [1814]: notice: Not currently connected. Jul 22 08:18:47 node01 crmd: [1814]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry Jul 22 08:18:47 node01 crmd: [1814]: info: te_connect_stonith: Attempting connection to fencing daemon... -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] FS mount error
On 22/07/10 12:23, Michael Fung wrote: crm resource cleanup WebFS That not help. node01:~# crm resource cleanup WebFS Cleaning up WebFS on mail02.fxclub.org Cleaning up WebFS on mail01.fxclub.org Jul 22 09:33:24 node01 crm_resource: [3442]: info: Invoked: crm_resource -C -r WebFS -H node01.domain.org Jul 22 09:33:25 node01 crmd: [1814]: ERROR: stonithd_signon: Can't initiate connection to stonithd Jul 22 09:33:25 node01 crmd: [1814]: notice: Not currently connected. Jul 22 09:33:25 node01 crmd: [1814]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry Jul 22 09:33:25 node01 crmd: [1814]: info: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jul 22 09:33:25 node01 crmd: [1814]: info: unpack_graph: Unpacked transition 647: 6 actions in 6 synapses Jul 22 09:33:25 node01 crmd: [1814]: info: do_te_invoke: Processing graph 647 (ref=pe_calc-dc-1279787604-2520) derived from /var/lib/pengine/pe-input-691.bz2 Jul 22 09:33:25 node01 crmd: [1814]: info: te_rsc_command: Initiating action 2: stop WebFS_stop_0 on node02.domain.org Jul 22 09:33:25 node01 crmd: [1814]: info: te_rsc_command: Initiating action 6: probe_complete probe_complete on node02.domain.org - no waiting ... Jul 22 09:33:32 node01 crmd: [1814]: WARN: status_from_rc: Action 43 (WebFS_start_0) on node02.domain.org failed (target: 0 vs. rc: 1): Error Jul 22 09:33:32 node01 crmd: [1814]: WARN: update_failcount: Updating failcount for WebFS on node02.domain.org after failed start: rc=1 (update=INFINITY, time=1279787612) Jul 22 09:33:32 node01 crmd: [1814]: info: abort_transition_graph: match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=WebFS_start_0, magic=0:1;43:647:0:882b3ca6-0496-4e26-9137-0a10d6ce57e4, cib=0.144.897) : Event failed Jul 22 09:33:32 node01 crmd: [1814]: info: update_abort_priority: Abort priority upgraded from 0 to 1 Jul 22 09:33:32 node01 crmd: [1814]: info: update_abort_priority: Abort action done superceeded by restart Jul 22 09:33:32 node01 crmd: [1814]: info: match_graph_event: Action WebFS_start_0 (43) confirmed on node02.domain.org (rc=4) Jul 22 09:33:32 node01 crmd: [1814]: info: run_graph: Jul 22 09:33:32 node01 crmd: [1814]: notice: run_graph: Transition 647 (Complete=4, Pending=0, Fired=0, Skipped=2, Incomplete=0, Source=/var/lib/pengine/pe-input-691.bz2): Stopped Jul 22 09:33:32 node01 crmd: [1814]: info: te_graph_trigger: Transition 647 is now complete -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker see double node`s
On 14/07/10 16:48, Florian Haas wrote: I take it you switched cluster stacks, otherwise you wouldn't be seeing each node twice, once with the $id attribute and once without. Take a look at http://www.clusterlabs.org/wiki/Initial_Configuration#A_Special_Note_for_People_Switching_Cluster_Stacks Thanks - it is work like a charm. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker