Re: [Pacemaker] Business Standard India - Errors noticed in pacemaker
Thought about googling the error? On Mon, Jul 25, 2011 at 9:17 PM, Gururaj B Patil gururaj.pa...@bsmail.inwrote: To support staff, We in Business Standard Ltd. use pacemaker as clusttering application for one of our website. Two servers are are in clusttering mode. One of the server is web server and another one is mysql db server. Pacemaker handles Mysql clustering at block level. We have noticed same type of notice and warning in the server's message file. Errors are as below. --- Messages like below appear every 15 minutes sidrbd0 pengine: [3143]: notice: get_failcount: Failcount for sipdu1 on sidrbd0 has expired (limit was 20s) sidrbd0 pengine: [3143]: ERROR: create_notification_boundaries: Creating boundaries for mysql-ms-drbd --- I have registered for pacemaker mailing list also. Regards, Gururaj Patil Systems Department Business Standard Ltd. H3/4, Paragon center, P.B.Marg, Worli Mumbai - 400013 India Ph.+91-22-24971924 -- *Disclaimer:* This communication/message is for the named addressees only. This transmission may contain information that is privileged, confidential, proprietary or legally privileged, and /or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the material in its entirety, whether in electronic or hard copy format. You are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROBHIBITED. You must not, directly or indirectly, use, disclose, distribute, print or copy any part of this message. *WARNING :*This electronic mail and any attachments are believed to be free of any virus or other defect, the recipient must ensure that it is virus free and no responsibility is accepted by Business Standard Limited and /or its employees as applicable for any loss or damage arising in any way from its use. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster type is: corosync
On Mon, Jul 25, 2011 at 7:18 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: 25.07.2011 10:10, Andrew Beekhof пишет: Which packages are you using? It is your official source from repository I build. Ok. And did you add the pacemaker configuration options to corosync's config file? pacemaker-1.1.5 corosync-1.4.0 cluster-glue-1.0.6 openais-1.1.2 All nodes have same rpms. On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: Hello again! Hope I`m not flooding too much here but I have another problem. I install same rpm of corosync, openais, pacemaker, cluster_glue on all nodes. I check it twice. And then I start some of they - they can`t connect to cluster and stays offline. In logs I see what they see other nodes and connectivity is ok. But I found the difference: Online nodes in cluster have: [root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 20:38:58 mysender39.mail.ru stonith-ng: [3499]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.mail.ru attrd: [3502]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.mail.ru cib: [3500]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:59 mysender39.mail.ru crmd: [3504]: info: get_cluster_type: Cluster type is: 'openais'. Offline have: [root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 13:39:17 mysender2.mail.ru stonith-ng: [9028]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.mail.ru attrd: [9031]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.mail.ru cib: [9029]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:18 mysender2.mail.ru crmd: [9033]: info: get_cluster_type: Cluster type is: 'corosync'. What`s wrong and how can I fix it? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster type is: corosync
On 07/26/2011 11:00 AM, Andrew Beekhof wrote: On Mon, Jul 25, 2011 at 7:18 PM, Proskurin Kirill k.prosku...@corp.example.com wrote: 25.07.2011 10:10, Andrew Beekhof пишет: Which packages are you using? It is your official source from repository I build. Ok. And did you add the pacemaker configuration options to corosync's config file? I attach our corosync.conf. It is same on all nodes except IP addr. Pacemaker is black now - no configuration at all. Online nodes: [root@mysender1 ~]# crm configure show node mysender1.example.com node mysender2.example.com node mysender3.example.com node mysender4.example.com node mysender5.example.com node mysender6.example.com node mysender7.example.com property $id=cib-bootstrap-options \ dc-version=1.1.5-3-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \ cluster-infrastructure=openais \ expected-quorum-votes=6 Offline nodes(Cluster type is: corosync) [root@mysender2 ~]# crm configure show [root@mysender2 ~]# pacemaker-1.1.5 corosync-1.4.0 cluster-glue-1.0.6 openais-1.1.2 All nodes have same rpms. On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill k.prosku...@corp.example.comwrote: Hello again! Hope I`m not flooding too much here but I have another problem. I install same rpm of corosync, openais, pacemaker, cluster_glue on all nodes. I check it twice. And then I start some of they - they can`t connect to cluster and stays offline. In logs I see what they see other nodes and connectivity is ok. But I found the difference: Online nodes in cluster have: [root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 20:38:58 mysender39.example.com stonith-ng: [3499]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.example.com attrd: [3502]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.example.com cib: [3500]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:59 mysender39.example.com crmd: [3504]: info: get_cluster_type: Cluster type is: 'openais'. Offline have: [root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 13:39:17 mysender2.example.com stonith-ng: [9028]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.example.com attrd: [9031]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.example.com cib: [9029]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:18 mysender2.example.com crmd: [9033]: info: get_cluster_type: Cluster type is: 'corosync'. What`s wrong and how can I fix it? -- Best regards, Proskurin Kirill totem { version: 2 token: 2500 token_retransmits_before_loss_const: 10 join: 100 consensus: 3000 vsftype: none max_messages: 20 send_join: 45 secauth:off fail_recv_const: 5000 interface { ringnumber: 0 bindnetaddr: 10.6.1.155 mcastaddr: 239.255.1.1 mcastport: 5405 ttl: 31 } } logging { fileline: off to_syslog: no to_stderr: no to_logfile: yes logfile: /var/log/corosync.log debug: off timestamp: on } amf { mode: disabled } ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Business Standard India - Errors noticed in pacemaker
Dear Mr.Andrew Beekhof Yes I did try googeling but could not get proper information. In few forums I noticed message as below. That's development logging, which was accidentally bumped to a higher log level. We are searching again but meanwhile can pacemaker team help on this. Regards, Gururaj Patil From: pacemaker-requ...@oss.clusterlabs.org To: pacemaker@oss.clusterlabs.org Date: 07/26/2011 12:47 PM Subject:Pacemaker Digest, Vol 44, Issue 50 Send Pacemaker mailing list submissions to pacemaker@oss.clusterlabs.org To subscribe or unsubscribe via the World Wide Web, visit http://oss.clusterlabs.org/mailman/listinfo/pacemaker or, via email, send a message with subject or body 'help' to pacemaker-requ...@oss.clusterlabs.org You can reach the person managing the list at pacemaker-ow...@oss.clusterlabs.org When replying, please edit your Subject line so it is more specific than Re: Contents of Pacemaker digest... Today's Topics: 1. Re: Business Standard India - Errors noticed inpacemaker (Andrew Beekhof) 2. Re: Cluster type is: corosync (Andrew Beekhof) 3. Please teach it about handling of the unmanaged resource in environment setting placement-strategy. (Yuusuke IIDA) -- Message: 1 Date: Tue, 26 Jul 2011 16:02:30 +1000 From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Cc: M A Faruqui m.faru...@bsmail.in, Prafulla H Patil prafulla.pa...@bsmail.in, Bandana Roy bandana@bsmail.in Subject: Re: [Pacemaker] Business Standard India - Errors noticed in pacemaker Message-ID: caedlwg0exctnrjq8kduv8jdc3hztvsovb7gdnyv4mvppox_...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 Thought about googling the error? On Mon, Jul 25, 2011 at 9:17 PM, Gururaj B Patil gururaj.pa...@bsmail.inwrote: To support staff, We in Business Standard Ltd. use pacemaker as clusttering application for one of our website. Two servers are are in clusttering mode. One of the server is web server and another one is mysql db server. Pacemaker handles Mysql clustering at block level. We have noticed same type of notice and warning in the server's message file. Errors are as below. --- Messages like below appear every 15 minutes sidrbd0 pengine: [3143]: notice: get_failcount: Failcount for sipdu1 on sidrbd0 has expired (limit was 20s) sidrbd0 pengine: [3143]: ERROR: create_notification_boundaries: Creating boundaries for mysql-ms-drbd --- I have registered for pacemaker mailing list also. Regards, Gururaj Patil Systems Department Business Standard Ltd. H3/4, Paragon center, P.B.Marg, Worli Mumbai - 400013 India Ph.+91-22-24971924 -- *Disclaimer:* This communication/message is for the named addressees only. This transmission may contain information that is privileged, confidential, proprietary or legally privileged, and /or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the material in its entirety, whether in electronic or hard copy format. You are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROBHIBITED. You must not, directly or indirectly, use, disclose, distribute, print or copy any part of this message. *WARNING :*This electronic mail and any attachments are believed to be free of any virus or other defect, the recipient must ensure that it is virus free and no responsibility is accepted by Business Standard Limited and /or its employees as applicable for any loss or damage arising in any way from its use. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- next part -- An HTML attachment was scrubbed... URL: http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20110726/4d99749e/attachment-0001.html -- Message: 2 Date: Tue, 26 Jul 2011 17:00:56 +1000 From: Andrew Beekhof and...@beekhof.net To: Pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Cluster type is: corosync
Re: [Pacemaker] Problem with colocation
Hello, On 25.7.11 13:28, Yingliang Yang wrote: constraints rsc_colocation id=Sphinx_with_IP rsc=Sphinx score-attribute=INF with-rsc=Sphinx_IP/ /constraints There is a problem in your config. The score-attribute should be score and its value should be INFINITY. Thanks, you were correct. The Cluster from Scratch manual uses inf shorthand all the time, so I thought it would work. Should this kind of error pass the schema check anyways? -- Taneli Leppä | CISSP, RHCE, ZCE, CMDEV Crasman Co Ltd | tan...@crasman.fi ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Clone resource each instance start\Stop
Hi, I have configured a multi-state(clone)resource float IP(IP). It is running on all the configure Nodes. I am trying to stop it using crm_resource command crm_resource -r IP:0 -p target-role -v stopped I am getting this error. Error performing operation : The object/attribute does not exist. Please anybody can help me. How can I stop a single instance using any command If I manually down a single instance on one node ,then i clean instance than it comes up means it start again. ifconfig eth0:1 down crm_resource -C -r IP:0 -H NodeName It is working properly. Cluster stack corosync-1.2 pacemaker-1.10 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] The version about pacemaker
Hi, Andrew I'd like to know whether the latest version(c86cb93c5a57) of pacemaker codes on the site is stable? If not, how about Pacemaker 1.1.5(01e86afaaa6d)? BTW, I'd like to know when the version 1.1.6 will get released, will it be the NEAR future? Best Regards, Yingliang Yang ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] ping RA question
2011/7/22 Dan Urist wrote: I am in the process of trying to write an fping RA, based on the pacemaker ping RA. My impetus for this is that I would like the RA to return success as soon as any ping succeeds; the behavior of linux's system ping as used in the standard ping RA is to run COUNT pings within the given deadline and only after COUNT or the deadline return success if any of the pings succeeded-- very inefficient. My question is this: the ping RA sets default values for OCF_RESKEY_CRM_meta_timeout and OCF_RESKEY_CRM_meta_interval, and it tests that OCF_RESKEY_CRM_meta_interval is an integer greater than 0. These variables aren't used anywhere else within the RA, but these are the same values in the actions section of the metadata for the monitor timeout and interval. I can't find any documentation that these variables serve as defaults for the monitor action in either the OCF agent developer guide or the pacemaker docs, but this seems to be the intent. Is this what they're there for? I think so. The values in the actions section of the metadata for the monitor timeout and interval, are used as monitor operation's default values. OCF_RESKEY_CRM_meta_timeout and OCF_RESKEY_CRM_meta_interval are come from the monitor operation's actual values. Best Regards, Yingliang Yang ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] ping RA question
On Tue, 26 Jul 2011 18:41:25 +0800 Yingliang Yang zjut...@gmail.com wrote: 2011/7/22 Dan Urist wrote: I am in the process of trying to write an fping RA, based on the pacemaker ping RA. My impetus for this is that I would like the RA to return success as soon as any ping succeeds; the behavior of linux's system ping as used in the standard ping RA is to run COUNT pings within the given deadline and only after COUNT or the deadline return success if any of the pings succeeded-- very inefficient. My question is this: the ping RA sets default values for OCF_RESKEY_CRM_meta_timeout and OCF_RESKEY_CRM_meta_interval, and it tests that OCF_RESKEY_CRM_meta_interval is an integer greater than 0. These variables aren't used anywhere else within the RA, but these are the same values in the actions section of the metadata for the monitor timeout and interval. I can't find any documentation that these variables serve as defaults for the monitor action in either the OCF agent developer guide or the pacemaker docs, but this seems to be the intent. Is this what they're there for? I think so. The values in the actions section of the metadata for the monitor timeout and interval, are used as monitor operation's default values. Not to be pedantic, but that's not what the OCF RA developer's guide says, from http://www.linux-ha.org/doc/dev-guides/_metadata.html: Every action should list its own timeout value. This is a hint to the user what minimal timeout should be configured for the action. This is meant to cater for the fact that some resources are quick to start and stop (IP addresses or filesystems, for example), some may take several minutes to do so (such as databases). In addition, recurring actions (such as monitor) should also specify a recommended minimum interval, which is the time between two consecutive invocations of the same action. Like timeout, this value does not constitute a default — it is merely a hint for the user which action interval to configure, at minimum. OCF_RESKEY_CRM_meta_timeout and OCF_RESKEY_CRM_meta_interval are come from the monitor operation's actual values. Are you sure these variables are used for the monitor action? There's no documentation for these that I can find, either in the OCF RA developer's guide or here: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-resource-options.html I've grepped through all the resource agents under /usr/lib/ocf/resource.d (this is on a Debian Lenny system); the only thing I can see meta_timeout used for is calculating a reasonable shutdown timeout, and the only thing I see meta_interval used for is to detect a probe. -- Dan Urist dur...@ucar.edu 303-497-2459 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster with DRBD : split brain
On Wed, Jul 20, 2011 at 11:36:25AM -0400, Digimer wrote: On 07/20/2011 11:24 AM, Hugo Deprez wrote: Hello Andrew, in fact DRBD was in standalone mode but the cluster was working : Here is the syslog of the drbd's split brain : Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0: Handshake successful: Agreed network protocol version 91 Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn( WFConnection - WFReportParams ) Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting asender thread (from drbd0_receiver [23281]) Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0: data-integrity-alg: not-used Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0: drbd_sync_handshake: Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F bits:75338 flags:0 Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer 8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F bits:769 flags:0 Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0: uuid_compare()=100 by rule 90 Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0: Split-Brain detected, dropping connection! Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta connection shut down by peer. Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn( WFReportParams - NetworkFailure ) Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender terminated Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0: Terminating asender thread Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn( NetworkFailure - Disconnecting ) Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error receiving ReportState, l: 4! Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0: Connection closed Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn( Disconnecting - StandAlone ) Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver terminated Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0: Terminating receiver thread This was a DRBD split-brain, not a pacemaker split. I think that might have been the source of confusion. The split brain occurs when both DRBD nodes lose contact with one another and then proceed as StandAlone/Primary/UpToDate. To avoid this, configure fencing (stonith) in Pacemaker, then use 'crm-fence-peer.sh' in drbd.conf; === disk { fencing resource-and-stonith; } handlers { outdate-peer/path/to/crm-fence-peer.sh; } === Thanks, that is basically right. Let me fill in some details, though: This will tell DRBD to block (resource) and fence (stonith). DRBD will drbd fencing options are fencing resource-only, and fencing resource-and-stonith. resource-only does *not* block IO while the fencing handler runs. resource-and-stonith does block IO. not resume IO until either the fence script exits with a success, or until an admit types 'drbdadm resume-io res'. The CRM script simply calls pacemaker and asks it to fence the other node. No. It tries to place a constraint forcing the Master role off of any node but the one with the good data. When a node has actually failed, then the lost no is fenced. If both nodes are up but disconnected, as you had, then only the fastest node will succeed in calling the fence, and the slower node will be fenced before it can call a fence. fenced may be restricted from being/becoming Master by that fencing constraint. Or, if pacemaker decided to do so, actually shot by some node level fencing agent (stonith). All that resource-level fencing by placing some constraint stuff obviously only works as long as the cluster communication is still up. It not only the drbd replication link had issues, but the cluster communication was down as well, it becomes a bit more complex. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster with DRBD : split brain
On 07/26/2011 11:43 AM, Lars Ellenberg wrote: On Wed, Jul 20, 2011 at 11:36:25AM -0400, Digimer wrote: On 07/20/2011 11:24 AM, Hugo Deprez wrote: Hello Andrew, in fact DRBD was in standalone mode but the cluster was working : Here is the syslog of the drbd's split brain : Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0: Handshake successful: Agreed network protocol version 91 Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn( WFConnection - WFReportParams ) Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting asender thread (from drbd0_receiver [23281]) Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0: data-integrity-alg: not-used Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0: drbd_sync_handshake: Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F bits:75338 flags:0 Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer 8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F bits:769 flags:0 Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0: uuid_compare()=100 by rule 90 Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0: Split-Brain detected, dropping connection! Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta connection shut down by peer. Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn( WFReportParams - NetworkFailure ) Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender terminated Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0: Terminating asender thread Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn( NetworkFailure - Disconnecting ) Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error receiving ReportState, l: 4! Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0: Connection closed Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn( Disconnecting - StandAlone ) Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver terminated Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0: Terminating receiver thread This was a DRBD split-brain, not a pacemaker split. I think that might have been the source of confusion. The split brain occurs when both DRBD nodes lose contact with one another and then proceed as StandAlone/Primary/UpToDate. To avoid this, configure fencing (stonith) in Pacemaker, then use 'crm-fence-peer.sh' in drbd.conf; === disk { fencing resource-and-stonith; } handlers { outdate-peer/path/to/crm-fence-peer.sh; } === Thanks, that is basically right. Let me fill in some details, though: This will tell DRBD to block (resource) and fence (stonith). DRBD will drbd fencing options are fencing resource-only, and fencing resource-and-stonith. resource-only does *not* block IO while the fencing handler runs. resource-and-stonith does block IO. Ahhh, that's why I was confused. I thought the 'resource' meant the same thing in both cases, but had only read the 'resource-and-stonith' section. not resume IO until either the fence script exits with a success, or until an admit types 'drbdadm resume-io res'. The CRM script simply calls pacemaker and asks it to fence the other node. No. It tries to place a constraint forcing the Master role off of any node but the one with the good data. Ok, I thought it was akin to the 'obliterate-peer.sh' script, which calls 'fence_node'... I made an assumption, which was not correct. When a node has actually failed, then the lost no is fenced. If both nodes are up but disconnected, as you had, then only the fastest node will succeed in calling the fence, and the slower node will be fenced before it can call a fence. fenced may be restricted from being/becoming Master by that fencing constraint. Or, if pacemaker decided to do so, actually shot by some node level fencing agent (stonith). All that resource-level fencing by placing some constraint stuff obviously only works as long as the cluster communication is still up. It not only the drbd replication link had issues, but the cluster communication was down as well, it becomes a bit more complex. Thanks for the clarity. Today I learned. :) -- Digimer E-Mail: digi...@alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math?
Re: [Pacemaker] The version about pacemaker
On Tue, Jul 26, 2011 at 7:15 PM, Yingliang Yang zjut...@gmail.com wrote: Hi, Andrew I'd like to know whether the latest version(c86cb93c5a57) of pacemaker codes on the site is stable? Yes. It is. If not, how about Pacemaker 1.1.5(01e86afaaa6d)? BTW, I'd like to know when the version 1.1.6 will get released, will it be the NEAR future? I hope so :-( Lately I've been required to work on some other things but I should be handing over responsibilities for those tasks and be back on pacemaker full time very soon. Best Regards, Yingliang Yang ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Resources are not restarted on definition change after f59d7460bdde (devel)
On Fri, Jul 1, 2011 at 4:59 PM, Andrew Beekhof and...@beekhof.net wrote: Hmm. Interesting. I will investigate. This is an unfortunate side-effect of my history compression patch. Since we only store the last successful and last failed operation, we don't have the md5 of the start operation around to check when a resource's definition is changed. Solutions appear to be either: a) give up the space savings and revert the history compression patch b) always restart a resource if a non-matching md5 is detected - even if the operation was a recurring monitor I'd favor b) along with dropping the per-operation parameters. The only valid use-case I've heard for those is setting OCF_LEVEL or depth or whatever it was called - and I think we're in basic agreement that we need a better solution for that anyway. Perhaps promoting it to be an attribute of the op tag (along with timeout etc). On Tue, Jun 28, 2011 at 3:46 AM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi all, I'm pretty sure I bisected commit which breaks restart of (node local) resources after definition change. Nodes which has f59d7460bdde applied (v03-a and v03-b in my case) do not restart such resources, while node without this commit (mgmt01) does. Here is snippet from DC (grrr, thunderbird does not like long lines): Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition: Parameters to libvirt-install-fs:0_start_0 on mgmt01 changed: recorded a2a2341cf3c157a1b44dd9ed7068e2dd vs. 31e7242629b49443f536c22192debb15 (all:3.0.5) 0:0;150:2:0:62c60b6a-17e8-4dbf-8291-a01e7ea06b6a Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition: Parameters to libvirt-install-fs:0_monitor_36 on mgmt01 changed: recorded 346bad4576870d644109c1e6233002aa vs. d9c16f21c130ae8da55d8eac0b6c6cdc (all:3.0.5) 0:0;153:2:0:62c60b6a-17e8-4dbf-8291-a01e7ea06b6a Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition: Parameters to libvirt-install-fs:0_monitor_24 on mgmt01 changed: recorded fbdf86bce136d60e21c1ef1fad451c0d vs. 11cd729f3313767ad7383c42495e612b (all:3.0.5) 0:0;152:2:0:62c60b6a-17e8-4dbf-8291-a01e7ea06b6a Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition: Parameters to libvirt-install-fs:0_monitor_12 on mgmt01 changed: recorded 34e9fed5be3737e563b47b0c3e353db1 vs. 54b02cd722053809bd0b1a3619adfd3b (all:3.0.5) 0:0;151:2:0:62c60b6a-17e8-4dbf-8291-a01e7ea06b6a Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition: Parameters to libvirt-install-fs:1_monitor_36 on v03-a changed: recorded 346bad4576870d644109c1e6233002aa vs. d9c16f21c130ae8da55d8eac0b6c6cdc (all:3.0.5) 0:0;177:2:0:9b3096b4-6add-4612-937c-f7013b18fd15 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition: Parameters to libvirt-install-fs:1_monitor_24 on v03-a changed: recorded fbdf86bce136d60e21c1ef1fad451c0d vs. 11cd729f3313767ad7383c42495e612b (all:3.0.5) 0:0;176:2:0:9b3096b4-6add-4612-937c-f7013b18fd15 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition: Parameters to libvirt-install-fs:1_monitor_12 on v03-a changed: recorded 34e9fed5be3737e563b47b0c3e353db1 vs. 54b02cd722053809bd0b1a3619adfd3b (all:3.0.5) 0:0;175:2:0:9b3096b4-6add-4612-937c-f7013b18fd15 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition: Parameters to libvirt-install-fs:2_monitor_36 on v03-b changed: recorded 346bad4576870d644109c1e6233002aa vs. d9c16f21c130ae8da55d8eac0b6c6cdc (all:3.0.5) 0:0;182:3:0:76ced8fb-1f7b-4a40-898c-a134b816b791 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition: Parameters to libvirt-install-fs:2_monitor_24 on v03-b changed: recorded fbdf86bce136d60e21c1ef1fad451c0d vs. 11cd729f3313767ad7383c42495e612b (all:3.0.5) 0:0;181:3:0:76ced8fb-1f7b-4a40-898c-a134b816b791 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition: Parameters to libvirt-install-fs:2_monitor_12 on v03-b changed: recorded 34e9fed5be3737e563b47b0c3e353db1 vs. 54b02cd722053809bd0b1a3619adfd3b (all:3.0.5) 0:0;180:3:0:76ced8fb-1f7b-4a40-898c-a134b816b791 = Then resource is restarted on mgmt01 but not on other nodes. First line from logs snipped (for libvirt-install-fs:0_start_0 operation) does not appear for start ops for resources on other nodes. The only difference between pacemaker builds is that commit. Hope this information could help to fix this (if not already done). Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list:
Re: [Pacemaker] Resource Group Questions - Start/Stop Order
On Thu, Jul 21, 2011 at 2:36 AM, Bobbie Lind bl...@sms-fed.com wrote: Hi group, I am running a 6 node system, 4 of which mount the LUNs for my Lustre file system. I currently have 29 LUNs per server set up in 4 Resource Groups. I understand the default startup/shudown order of the resource but I was wondering if there is a way to override that and have all the resources in the group startup or shutdown at the same time. Ideally what I am looking for is all the resources in the group OSS1group to startup and shutdown at the same time since none of them are dependent on each other, they just belong on the same server. I'd suggest just not using a group in this case. If all you want is colocation, use a colocation set. I found this thread here http://www.gossamer-threads.com/lists/linuxha/pacemaker/60893 which talks about non-ordered groups and I think that is what I need but I am at a loss as to how to find the parameters/attributes of the group to set it up. Is it possible to override the default action of the resource group's startup/shutdown order? Can someone point me to some documentation where I can find the available parameters that can be set for groups? I have attached my configuration in case it's needed and I am running Pacemaker 1.0.11 Bobbie Lind Systems Engineer Solutions Made Simple, Inc (SMSi) 703-296-3087 (Cell) bl...@sms-fed.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Upgrading from 1.0 to 1.1
On Tue, Jul 19, 2011 at 5:40 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: On 07/19/2011 03:22 AM, Andrew Beekhof wrote: On Fri, Jul 15, 2011 at 10:33 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: Hello all. I found what I using corosync with pacemaker ver:0 with installed pacemaker 1.1.5 - eg without start a pacemakerd. Sounds wrong. :-) So I try to upgrade. I shutdown one node. Change 0 to 1 on service.d/pcmk Start corosync and then start pacemakerd via init script. But this node stays online and on clusters DC I see: cib: [18392]: WARN: cib_peer_callback: Discarding cib_sync_one message (255) from mysender10.example.com: not in our membership Thats odd. The only you changed was ver: 0 to ver: 1 ? Yes, only this. To make it more clear - I have 4 nodes with ver 0 and try to add one with ver 1 and got this. Well I shutdown all nodes change all to 1 and star them up add all was ok. Not a really good way to upgrade but I don`t have time. Do you still have the logs for the failure case? I'd really like to see them. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] unexpected Error in Log files
On Tue, Jul 19, 2011 at 4:00 PM, rakesh rakirocker4...@gmail.com wrote: Hi I configured a cluster which consists of four nodes, started Heartbeat/pacemaker on four nodes. after some point of time 4th nodes gone down unexpectedly and find the following error messages while debugging all the log files like ha-debug and messages.log file . can you please help me out regarding this . please find the messages in the log file below. Jun 10 12:55:46 node4 ccm: 2011 Jun 10 12:55:46 PDT -0700 ccm: Cannot append to /var/log/ha-debug: File too large Jun 10 12:55:46 node4 stonithd: 2011 Jun 10 12:55:46 PDT -0700 stonithd: Cannot append to /var/log/ha-debug: File too large Jun 10 12:55:46 node4 last message repeated 6 times Jun 10 12:55:46 node4 cib: 2011 Jun 10 12:55:46 PDT -0700 cib: Cannot append to /var/log/ha-debug: File too large Jun 10 12:55:46 node4 cib: 2011 Jun 10 12:55:46 PDT -0700 cib: Cannot append to /var/log/ha-debug: File too large You might want to do something about that. Jun 10 12:55:46 node4 ccm: 2011 Jun 10 12:55:46 [17142]: WARN: cib_peer_callback: Discarding cib_apply_diff message (181) from node1: not in our membership All this says is that node1 left the cluster. There is no way for us to know why based on this one line. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Reload action and stop/start sequence questions
On Mon, Jul 11, 2011 at 5:45 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi all, Would somebody (Andrew?) please bring some light on how exactly redefinition of resource is supposed to be handled? Below is my (rather perfectionistic) vision on this, please correct me if/where I'm wrong: * If RA supports 'reload' action then it is called on resource definition change (instead of stop/start). Only if the attribute changed was NOT marked as unique in the metadata. * If 'reload' action fails then usual start/stop sequence is executed. This would give a chance to RA to refuse to reload if some key properties change, while allowing it to tune some secondary resource parameters. Of course, RA should leave resource in a usable state, so failure of reload action should indicate RA's denial to do a reload. How to differentiate that from real reload failures? Either way the resource needs to be restarted. So there is no need for differentiation. Is there some special exit code for that? * Dependent resources should not be stopped/started for 'reload' action. Of course they are restarted if reload fails and stop/start is executed then. (I see that they are restarted now for reload of a resource they depend on, is it a bug?) More like a limitation. Which is a round-a-bout way of saying really hard to fix bug. You're welcome to create a BZ for it though, maybe one day I'll figure out how to resolve it. * (wish) Resources should be migrated out of node (if they support live migration) for stop/start sequence of resource they depend on. Migration can only occur if a resource at the bottom (excluding any clones) of the resource stack. In order to migrate any colocation dependancies need to be running at _both_ the old and the new locations. This can only be true for resources that depend on clones. * (wish) Redefinition of clones should be handled in a way which allows dependent live-migratable resources to survive (if reload action for clone instance either is not supported or fails). This doesn't make sense. If the definition of one clone changes, then they all change and there is nowhere for dependant resources to migrate to. That is: dependent resources which support live migration are first tried to migrate out of one node, and are stopped if migration fails. Then clone instance is restarted on that node. Then the same procedure applies to next cluster node so resources may return back to a first node. If above (at least first three points) is right, then is it possible to get a set of previous instance parameters the same way new configuration is passed (env vars), or RA should save that information itself in advance? Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Business Standard India - Errors noticed in pacemaker
-- An HTML attachment was scrubbed... URL: http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20110726/4d99749e/attachment-0001.html -- Message: 2 Date: Tue, 26 Jul 2011 17:00:56 +1000 From: Andrew Beekhof and...@beekhof.net To: Pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Cluster type is: corosync Message-ID: CAEDLWG1s=ahrdxdwdo0r04410j0+ygj7vfaz_yf_0fmdpin...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 On Mon, Jul 25, 2011 at 7:18 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: 25.07.2011 10:10, Andrew Beekhof ?: Which packages are you using? It is your official source from repository I build. Ok. And did you add the pacemaker configuration options to corosync's config file? pacemaker-1.1.5 corosync-1.4.0 cluster-glue-1.0.6 openais-1.1.2 All nodes have same rpms. On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill k.prosku...@corp.mail.ru ?wrote: Hello again! Hope I`m not flooding too much here but I have another problem. I install same rpm of corosync, openais, pacemaker, cluster_glue on all nodes. I check it twice. And then I start some of they - they can`t connect to cluster and stays offline. In logs I see what they see other nodes and connectivity is ok. But I found the difference: Online nodes in cluster have: [root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 20:38:58 mysender39.mail.ru stonith-ng: [3499]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.mail.ru attrd: [3502]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:58 mysender39.mail.ru cib: [3500]: info: get_cluster_type: Cluster type is: 'openais'. Jul 22 20:38:59 mysender39.mail.ru crmd: [3504]: info: get_cluster_type: Cluster type is: 'openais'. Offline have: [root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log Jul 22 13:39:17 mysender2.mail.ru stonith-ng: [9028]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.mail.ru attrd: [9031]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:17 mysender2.mail.ru cib: [9029]: info: get_cluster_type: Cluster type is: 'corosync'. Jul 22 13:39:18 mysender2.mail.ru crmd: [9033]: info: get_cluster_type: Cluster type is: 'corosync'. What`s wrong and how can I fix it? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Message: 3 Date: Tue, 26 Jul 2011 16:12:25 +0900 From: Yuusuke IIDA iiday...@intellilink.co.jp To: pacemaker@oss pacemaker@oss.clusterlabs.org Cc: tanaka...@intellilink.co.jp Subject: [Pacemaker] Please teach it about handling of the unmanaged resource in environment setting placement-strategy. Message-ID: 4e2e68d9.6010...@intellilink.co.jp Content-Type: text/plain; charset=iso-2022-jp Hi, Yan Hi, Andrew I used the function of placement-strategy and found movement to be worried about. There is node act3 which the resource that became the unmanaged state starts. The resource that started then in node act1 broke down and moved. I hoped that this inoperative resource moved to node sby1, but it was not carried out. Is the movement that a resource with other capacity moves in the node that the resource of the unmanaged state meets capacity right as specifications? I want you to revise it to decide placement in consideration of the capacity of the unmanaged resource. I attach crm_report when a problem happened. Best Regards, Yuusuke -- METRO SYSTEMS CO., LTD Yuusuke Iida Mail: iiday...@intellilink.co.jp -- next part -- A non-text attachment was scrubbed... Name: pcmk-Tue-26-Jul-2011.tar.bz2 Type: application/octet-stream Size: 164949 bytes Desc: not available URL: http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20110726/66852f5b/attachment.obj -- ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker End of Pacemaker Digest, Vol 44, Issue 50 * -- *Disclaimer:* This communication/message is for the named addressees only. This transmission may contain information that is privileged, confidential, proprietary or legally privileged, and /or exempt from disclosure under applicable law. If you are not the intended
Re: [Pacemaker] Initial quorum
On Thu, Jul 21, 2011 at 4:13 PM, pskrap psk...@hotmail.com wrote: Devin Reade gdr@... writes: --On Wednesday, July 20, 2011 09:19:33 AM + pskrap pskrap@... wrote: I have a cluster where some of the resources cannot run on the same node. All resources must be running to provide a functioning service. This means that a certain amount of nodes needs to be up before it makes sense for the cluster to start any resources. Without knowing anything about your application, I would tend to question this statement. Is it true that you must not start *any* resources before you have enough nodes, or is sufficient to say that the application is not considered up until all resources are started? It may not make sense to run any, but does it do any harm? If you *can* start at least some resources before all nodes are available, then I would expect that you could get by with defining colocation constraints to ensure that some resources don't run on the same nodes, perhaps augmenting things with some order constraints if necessary. If your applications die or do other horrible stuff when only some subset are running then I'd have a talk with your application developers as it sounds like a larger robustness problem. Devin No, there are no crash issues etc when all resources are not running. The application is just not usable until all resources are started. As for the harm, the resources which have constraints preventing them from running will fail, Are you talking about constraints in the pacemaker config or some other kind? but I guess they will recover as more nodes are added. The harm is mostly in the fact that starting nodes one by one will cause the resources to be unevenly distributed over the nodes since everything will start on the nodes in the order they are installed. I know I can give a preferred node to a resource and allow it to relocate when it becomes available. However, this application provides a real-time service so I only want resources to relocate when it is absolutely necessary. Therefore I have given the resources a preferred node, but do not allow them to relocate when it becomes available. So I guess the overall harm is limited even though it exists. I was just looking for a cleaner startup for the system. Since you did not mention any way to do what my question was about I assume it is currently not possible to do what I asked for. I do think such an option would be useful though. Logically it does not make sense for the cluster to be starting resources for an application before the cluster have enough nodes for the application to be able to run. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Business Standard India - Errors related to pacemaker in server message file
mailing list submissions to pacemaker@oss.clusterlabs.org To subscribe or unsubscribe via the World Wide Web, visit http://oss.clusterlabs.org/mailman/listinfo/pacemaker or, via email, send a message with subject or body 'help' to pacemaker-requ...@oss.clusterlabs.org You can reach the person managing the list at pacemaker-ow...@oss.clusterlabs.org When replying, please edit your Subject line so it is more specific than Re: Contents of Pacemaker digest... Today's Topics: 1. Re: Business Standard India - Errors noticed in pacemaker (Andrew Beekhof) 2. Re: Cluster type is: corosync (Andrew Beekhof) 3. Please teach it about handling of the unmanaged resource in environment setting placement-strategy. (Yuusuke IIDA) -- Message: 1 Date: Tue, 26 Jul 2011 16:02:30 +1000 From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Cc: M A Faruqui m.faru...@bsmail.in, Prafulla H Patil prafulla.pa...@bsmail.in, Bandana Roy bandana@bsmail.in Subject: Re: [Pacemaker] Business Standard India - Errors noticed in pacemaker Message-ID: caedlwg0exctnrjq8kduv8jdc3hztvsovb7gdnyv4mvppox_...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 Thought about googling the error? On Mon, Jul 25, 2011 at 9:17 PM, Gururaj B Patil gururaj.pa...@bsmail.in wrote: To support staff, We in Business Standard Ltd. use pacemaker as clusttering application for one of our website. Two servers are are in clusttering mode. One of the server is web server and another one is mysql db server. Pacemaker handles Mysql clustering at block level. We have noticed same type of notice and warning in the server's message file. Errors are as below. --- Messages like below appear every 15 minutes sidrbd0 pengine: [3143]: notice: get_failcount: Failcount for sipdu1 on sidrbd0 has expired (limit was 20s) sidrbd0 pengine: [3143]: ERROR: create_notification_boundaries: Creating boundaries for mysql-ms-drbd --- I have registered for pacemaker mailing list also. Regards, Gururaj Patil Systems Department Business Standard Ltd. H3/4, Paragon center, P.B.Marg, Worli Mumbai - 400013 India Ph.+91-22-24971924 -- *Disclaimer:* This communication/message is for the named addressees only. This transmission may contain information that is privileged, confidential, proprietary or legally privileged, and /or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the material in its entirety, whether in electronic or hard copy format. You are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROBHIBITED. You must not, directly or indirectly, use, disclose, distribute, print or copy any part of this message. *WARNING :*This electronic mail and any attachments are believed to be free of any virus or other defect, the recipient must ensure that it is virus free and no responsibility is accepted by Business Standard Limited and /or its employees as applicable for any loss or damage arising in any way from its use. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- next part -- An HTML attachment was scrubbed... URL: http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20110726/4d99749e/attachment-0001.html -- Message: 2 Date: Tue, 26 Jul 2011 17:00:56 +1000 From: Andrew Beekhof and...@beekhof.net To: Pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Cluster type is: corosync Message-ID: CAEDLWG1s=ahrdxdwdo0r04410j0+ygj7vfaz_yf_0fmdpin...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 On Mon, Jul 25, 2011 at 7:18 PM, Proskurin Kirill k.prosku...@corp.mail.ru wrote: 25.07.2011 10:10, Andrew Beekhof ?: Which packages are you using? It is your official source from repository I build. Ok. And did you add the pacemaker configuration options to corosync's config file? pacemaker-1.1.5 corosync-1.4.0 cluster-glue-1.0.6 openais-1.1.2 All nodes have same rpms. On Fri