[Pacemaker] OCF module for SAP live cache
Hello All, We have SAP server which has includes SAP Live Cache. I'm not able find OCF script in the pacemaker classes or heartbeat classes, I would like to ask if there is some module like this or not Thanks Best regards, Jozef ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] A question and demand to a resource placement strategy function
On 08/22/11 22:09, Vladislav Bogdanov wrote: Hi Yan, 27.04.2011 08:14, Yan Gao wrote: [snip] Do priorities work for utilization strategy? Yes, the improvement works for utilization, minimal and balanced strategy: - The nodes that are more healthy and have more capacities get consumed first (globally preferred nodes). Does this still valid for current tip? Yes. I tried to use utilization to place resources (virtual machines) to nodes based on four parameters, cpu usage, ram usage and storage I/O usage (for two storages). While I see almost perfect balancing on first two parameters, other two (storage I/O) are not considered at all for resource placement. Here are excerpts from ptest -LUs output: Original: v03-a capacity: vds-ok-pool-0-usage= vds-ok-pool-1-usage= cpu-decipct=11000 ram-hugepages-mb=44360 Original: v03-b capacity: vds-ok-pool-0-usage= vds-ok-pool-1-usage= cpu-decipct=11000 ram-hugepages-mb=44360 ... native_color: vptest1.vds-ok.com-vm allocation score on v03-a: 0 native_color: vptest1.vds-ok.com-vm allocation score on v03-b: 0 calculate_utilization: vptest1.vds-ok.com-vm utilization on v03-b: vds-ok-pool-1-usage=100 cpu-decipct=330 ram-hugepages-mb=1024 native_color: vptest2.vds-ok.com-vm allocation score on v03-a: 0 native_color: vptest2.vds-ok.com-vm allocation score on v03-b: 0 calculate_utilization: vptest2.vds-ok.com-vm utilization on v03-a: vds-ok-pool-0-usage=100 cpu-decipct=330 ram-hugepages-mb=1024 ... Remaining: v03-a capacity: vds-ok-pool-0-usage=6799 vds-ok-pool-1-usage= cpu-decipct=110 ram-hugepages-mb=10568 Remaining: v03-b capacity: vds-ok-pool-0-usage=9899 vds-ok-pool-1-usage=6899 cpu-decipct=980 ram-hugepages-mb=10568 After that virtual machines placed such way so one node uses only first storage, other uses almost only second one. Am I missing something? When allocating every resource, we compare the capacity of the nodes. The node has more remaining capacity is preferred. This would be quite clear if we only define one kind of capacity. While if we define multiple kinds of capacity, for example: If nodeA has more cpus remaining, while nodeB has more ram and storage remaining -- nodeB has more capacity. If nodeA has more cpus and storage1 remaining, nodeB has more ram and storage2 remaining -- They have the equal capacity. Then the first listed node is preferred. Also could you please describe algorithm used for placement in little bit more details, so I would ask less stupid questions. Does it have something related to linear programming (f.e. http://en.wikipedia.org/wiki/Simplex_method)? An optimal solution definitely requires some mathematical optimization method. Though we don't use one so far. It's somewhat complicated to be introduced and combined with the current allocation factors. The policy for choosing a preferred node is just like above. The order for choosing resources to allocate is: 1. The resource with higher priority gets allocated first. 2. If their priorities are equal, check if they are already running. The resource has higher score on the node where it's running gets allocated first. (-- which was recently improved by the work of Andrew and Yuusuke to prevent resource shuffling.) 3. The resource has higher score on the preferred node gets allocated first. Regards, Gaoyan -- Gao,Yan y...@suse.com Software Engineer China Server Team, SUSE. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] compression with heartbeat doesn't seem to work
On Fri, Aug 19, 2011 at 08:02:24AM -0500, Schaefer, Diane E wrote: Hi, We are running a two-node cluster using pacemaker 1.1.5-18.1 with heartbeat 3.0.4-41.1. We are experiencing what seems like network issues and cannot make heartbeat recover. We are experiencing message too long and the systems can no longer sync. Our ha.cf is as follows: autojoin none use_logd false logfacility daemon debug 0 # use the v2 cluster resource manager crm yes # the cluster communication happens via unicast on bond0 and hb1 # hb1 is direct connect ucast hb1 169.254.1.3 ucast hb1 169.254.1.4 ucast bond0 172.28.102.21 ucast bond0 172.28.102.51 compression zlib compression_threshold 30 I suggest you try compression bz2 compression_threshold 30 traditional_compression yes The reason is: traditional compression compresses the full packet, if the uncompressed message size exceeds the compression_threshold. non-traditional compression compresses only message field values which are marked as to-be-compressed, and unfortunately, pacemaker does not always mark larger message fields in this way. Note that you are still limitted to 64kB in total [*] so in case you have a huge cib (many nodes, many resources, especially many cloned resources), in particular the the status section of the cib may grow too large. [*] theoretical maximum payload of a single UDP datagram; the heartbeat messaging layer does not spread message payload on multiple datagrams, and that is unlikely to change, unless someone really invests non-trivial amounts of developer time and money into extending it You probably should consider moving to corosync (= 1.4.x ), which spreads messages over as many datagrams as needed, up to a maximum message size of 1 MByte, iirc. Note that I avoid the term fragment here, because each datagram itself typically will be fragmented into pieces of MTU size. In any case, obviously you need a very reliable network stack: if you need more fragments you need to transmit a single message, you can tolerate less fragment loss. And UDP fragments may be among the first things that get dropped on the floor if the network stack experiences memory pressure. # msgfmt msgfmt netstring # a node will be flagged as dead if there is not response for 20 seconds deadtime 30 initdead 30 keepalive 250ms uuidfrom nodename # these are the node names participating in the cluster # the names should match uname -n output on the system node usrv-qpr2 node usrv-qpr5 We can ping all interfaces from both nodes. One of the bonded NICs had some trouble, but we believe we have enough redundancy built in that it should be fine. The issue we see that if we reboot the non DC node it can no longer sync with the DC. The log from the non-dc node shows remote node cannot be reached. Crm_mon of the non-dc node shows: Last updated: Fri Aug 19 07:39:05 2011 Stack: Heartbeat Current DC: NONE 2 Nodes configured, 2 expected votes 26 Resources configured. Node usrv-qpr2 (87df4a75-fa67-c05e-1a07-641fa79784e0): UNCLEAN (offline) Node usrv-qpr5 (7fb57f74-fae5-d493-e2c7-e4eda2430217): UNCLEAN (offline) From the DC it looks like all is well. I tried a cibadmin -Q from non DC and it can no longer contact the remote node. I tried a cibadmin -S from the non DC to force a sync which times out with Call cib_sync failed (-41): Remote node did not respond. On the DC side I see this: Aug 19 07:38:20 usrv-qpr2 heartbeat: [23249]: ERROR: write_child: write failure on ucast bond0.: Message too long Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 172.28.102.51:694 len=83696 [-1]: Message too long Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: write_child: write failure on ucast bond0.: Message too long Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: glib: ucast_write: Unable to send HBcomm packet hb1 169.254.1.3:694 len=83696 [-1]: Message too long Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: write_child: write failure on ucast hb1.: Message too long Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: glib: ucast_write: Unable to send HBcomm packet hb1 169.254.1.4:694 len=83696 [-1]: Message too long Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: write_child: write failure on ucast hb1.: Message too long Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue) Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue) Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 22 for usrv-qpr5: seqno too low Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =usrv-qpr5, fromnode's ackseq = 244435 Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information: Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =244943,
[Pacemaker] Antwort: OCF module for SAP live cache
Janec, Jozef jozef.ja...@hp.com schrieb am 23.08.2011 08:55:34: Hello All, We have SAP server which has includes SAP Live Cache. I'm not able find OCF script in the pacemaker classes or heartbeat classes, I would like to ask if there is some module like this or not Thanks Best regards, Jozef Hi Jozef, short answer: No. We currently have the two agents SAPInstance and SAPDatabase. But both are not valid to manage a SAP LiveCache. If you don't want to monitor the LiveCache, I would suggest to include the LiveCache start/stop commands to the USEREXITS of the SAP Central Instance (if in the same resource group). Monitoring would need a new resource agent, because LiveCache needs a complete different command set, than all the other SAP things. Currently I don't know, if it is important enough for many people. I would expect SAP to support LiveCache in the future with saphostctrl. I have prepared (for already a long time ) the changes to the SAPDatabase RA to work with saphostctrl. But not yet posted, because I'm still missing some features in saphostctrl to use it really as a HA tool. If both would be the case, you may be able in the future to use SAPDatabase for LiveCache (including monitoring). Best regards, Alex ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] migration fix for ocf:heartbeat:Xen
Message: 7 Date: Thu, 11 Aug 2011 21:07:00 + From: Daugherity, Andrew W adaugher...@tamu.edu To: pacemaker@oss.clusterlabs.org pacemaker@oss.clusterlabs.org Subject: [Pacemaker] migration fix for ocf:heartbeat:Xen Message-ID: 93b5e618-ad19-4993-8066-cb4f8e4ef...@tamu.edu Content-Type: text/plain; charset=us-ascii I have discovered that sometimes when migrating a VM, the migration itself will succeed, but the migrate_from call on the target node will fail, as apparently the status hasn't settled down yet. This is more likely to happen when stopping pacemaker on a node, causing all its VMs to migrate away. Migration succeeds, but then (sometimes) the status call in migrate_from fails, and the VM is unnecessarily stopped and started. Note that it is NOT a timeout problem, as the migrate_from operation (which only checks status) takes less than a second. I noticed the VirtualDomain RA does a loop rather than just checking the status once as the Xen RA does, so I patched a similar thing into the Xen RA, and that solved my problem. (patch/logs snipped) No comments? What does it take to get this patch accepted? I'd much rather use the mainline version than have to reapply my patch after every HAE update. I guess I could open an SR with Novell but this is ultimately an upstream issue. Andrew Daugherity Systems Analyst Division of Research, Texas AM University adaugher...@tamu.edu ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] RHEL6 / Scientific Linux 6: cluster-glue no longer includes stonith agents?
Hello, I'm trying to replicate a cluster I initially built for testing on CentOS 5.6, but with the fresher packages that come along with a 6.x release. CentOS is still playing catch-up, so their 6.0 pacemaker packages are a bit older. Based on that, I figured I'd try Scientific Linux 6.1 since it's the closest to current I can get being non-licensed for RHEL. My previous iteration of the cluster was heartbeat/pacemaker, but since pacemaker is now included as part of the stock repos I figured I'd stick with those only, which means cman/corosync/pacemaker. I have things working pretty much as I want, but still with stonith disabled because I have no stonith agents available in pacemaker at all. If I do crm ra list stonith, it comes back empty (where on 5.6 I have numerous agents to choose from). Looking at the older 5.6 nodes, I can see that the stonith agents mostly (or all?) came from the cluster-glue package, which I do have installed on the new cluster nodes as well. The newer cluster-glue packages just don't contain any stonith agents. Am I approaching this incorrectly, and I'm supposed to handle fencing in corosync (which does have tons of stonith bits installed with it)? If so, how do I tell pacemaker to fire off a fencing task via corosync? Am I supposed to configure pacemaker to use corosync's stonith commands? Any pointers appreciated, Mark ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] RHEL6 / Scientific Linux 6: cluster-glue no longer includes stonith agents?
On 08/23/2011 06:19 PM, mark - pacemaker list wrote: Hello, I'm trying to replicate a cluster I initially built for testing on CentOS 5.6, but with the fresher packages that come along with a 6.x release. CentOS is still playing catch-up, so their 6.0 pacemaker packages are a bit older. Based on that, I figured I'd try Scientific Linux 6.1 since it's the closest to current I can get being non-licensed for RHEL. My previous iteration of the cluster was heartbeat/pacemaker, but since pacemaker is now included as part of the stock repos I figured I'd stick with those only, which means cman/corosync/pacemaker. I have things working pretty much as I want, but still with stonith disabled because I have no stonith agents available in pacemaker at all. If I do crm ra list stonith, it comes back empty (where on 5.6 I have numerous agents to choose from). Looking at the older 5.6 nodes, I can see that the stonith agents mostly (or all?) came from the cluster-glue package, which I do have installed on the new cluster nodes as well. The newer cluster-glue packages just don't contain any stonith agents. Am I approaching this incorrectly, and I'm supposed to handle fencing in corosync (which does have tons of stonith bits installed with it)? If so, how do I tell pacemaker to fire off a fencing task via corosync? Am I supposed to configure pacemaker to use corosync's stonith commands? Any pointers appreciated, Mark In EL6, Pacemaker and RGManager resource and fence (stonith) agents are being merged. The package you want to install is 'fence-agents' (and 'resource-agents' for the resource management scripts). -- Digimer E-Mail: digi...@alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker