Re: [Pacemaker] [ha-wg-technical] [RFC] Organizing HA Summit 2015
Hi all, Really late response but, I will be joining the HA summit, with a few colleagues from NTT. See you guys in Brno, Thanks, 2014-12-08 22:36 GMT+09:00 Jan Pokorný jpoko...@redhat.com: Hello, it occured to me that if you want to use the opportunity and double as as tourist while being in Brno, it's about the right time to consider reservations/ticket purchases this early. At least in some cases it is a must, e.g., Villa Tugendhat: http://rezervace.spilberk.cz/langchange.aspx?mrsname=languageId=2returnUrl=%2Flist On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote: DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. -- Jan ___ ha-wg-technical mailing list ha-wg-techni...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/ha-wg-technical -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] PgSQL_Replicated_Cluster wiki's bug
Anzai-san, Thank you very much for pointing it out. Yes, you are right. I have updated the wiki page and should be fixed now. Regards, Keisuke MORI 2014-04-08 11:15 GMT+09:00 Naoya Anzai anzai-na...@mxu.nes.nec.co.jp: Hi,All I'm reading following wiki: http://clusterlabs.org/mwiki/index.php?title=PgSQL_Replicated_Clustersetlang=ja#Pacemaker_.28both_nodes.29 In Sample configuration for pcs command, it says as follows: --- pcs -f pgsql_cfg resource op defaults resource-stickiness=INFINITY pcs -f pgsql_cfg resource op defaults migration-threshold=1 --- But it doesn't work correctly, right? I think op command is unnecessary in their configurations. Correct commands are: --- pcs -f pgsql_cfg resource defaults resource-stickiness=INFINITY pcs -f pgsql_cfg resource defaults migration-threshold=1 --- Regards, Naoya --- Naoya Anzai Engineering Department NEC Solution Inovetors, Ltd. E-Mail: anzai-na...@mxu.nes.nec.co.jp --- ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pre_notify_demote is issued twice
Hi, 2014-02-24 10:49 GMT+09:00 Andrew Beekhof and...@beekhof.net: On 21 Feb 2014, at 2:19 pm, Andrew Beekhof and...@beekhof.net wrote: On 18 Feb 2014, at 1:23 pm, Andrew Beekhof and...@beekhof.net wrote: On 6 Feb 2014, at 7:45 pm, Keisuke MORI keisuke.mori...@gmail.com wrote: Hi, I observed that pre_notify_demote is issued twice when a master resource is migrating. I'm wondering if this is the correct behavior. Steps to reproduce: - Start up 2 nodes cluster configured for the PostgreSQL streaming replication using pgsql RA as a master/slave resource. - kill the postgresql process on the master node to induce a fail-over. - The fail-over succeeds as expected, but pre_notify_demote was executed twice on each node before demoting on the master resource. 100% reproducible on my cluster. Pacemaker version: 1.1.11-rc4 (source build from the repo) OS: RHEL6.4 I have never seen this on Pacemaker-1.0.* cluster with the same configuration. The relevant logs and pe-inputs are attached. Diagnostics: (1) The first transition caused by the process failure (pe-input-160) initiates pre_notify_demote on both nodes and cancelling slave monitor on the slave node. {{{ 171 Jan 30 16:08:59 rhel64-1 crmd[8143]: notice: te_rsc_command: Initiating action 9: cancel prmPostgresql_cancel_1 on rhel64-2 172 Jan 30 16:08:59 rhel64-1 crmd[8143]: notice: te_rsc_command: Initiating action 79: notify prmPostgresql_pre_notify_demote_0 on rhel64-1 (local) 175 Jan 30 16:08:59 rhel64-1 crmd[8143]: notice: te_rsc_command: Initiating action 81: notify prmPostgresql_pre_notify_demote_0 on rhel64-2 }}} (2) When cancelling slave monitor completes, the transition is aborted by Resource op removal. {{{ 176 Jan 30 16:08:59 rhel64-1 crmd[8143]: info: match_graph_event: Action prmPostgresql_monitor_1 (9) confirmed on rhel64-2 (rc=0) 177 Jan 30 16:08:59 rhel64-1 cib[8138]: info: cib_process_request: Completed cib_delete operation for section status: OK (rc=0, origin=rhel64-2/crmd/21, version=0.37.9) 178 Jan 30 16:08:59 rhel64-1 crmd[8143]: info: abort_transition_graph: te_update_diff:258 - Triggered transition abort (complete=0, node=rhel64-2, tag=lrm_rsc_op, id=prmPostgresql_monitor_1, magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) : Resource op removal }}} (3) The second transition is calculated by the abort (pe-input-161) which results initiating pre_notify_demote again. If the demote didn't complete (or wasn't even attempted), then we must send the pre_notify_demote again unfortunately. The real bug may well be that the transition shouldn't have been aborted. It looks legitimate: Jan 30 16:08:59 rhel64-1 crmd[8143]: info: abort_transition_graph: te_update_diff:258 - Triggered transition abort (complete=0, node=rhel64-2, tag=lrm_rsc_op, id=prmPostgresql_monitor_1, magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) : Resource op removal It looks like get_cancel_action() was not functioning correctly: https://github.com/beekhof/pacemaker/commit/9d77c99 Thanks for looking into it. I have confirmed that the issue is now resolved with the recent revision on your repo. at: https://github.com/beekhof/pacemaker/commit/04ff1bd2d144e7defd6f1f67f6bde6fa95c428e1 Thanks! -- Keisuke MORI Jan 30 16:08:59 rhel64-1 cib[8138]: info: cib_process_request: Completed cib_delete operation for section status: OK (rc=0, origin=rhel64-2/crmd/21, version=0.37.9) It looks like part of the node status entry being removed for rhel64-2. Possibly as a result of: Jan 30 16:07:54 rhel64-2 crmd[25070]: info: erase_status_tag: Deleting xpath: //node_state[@uname='rhel64-2']/transient_attributes The new cib code, being much faster, might help here too :) {{{ 227 Jan 30 16:09:01 rhel64-1 pengine[8142]: notice: process_pe_message: Calculated Transition 15: /var/lib/pacemaker/pengine/pe-input-161.bz2 229 Jan 30 16:09:01 rhel64-1 crmd[8143]: notice: te_rsc_command: Initiating action 78: notify prmPostgresql_pre_notify_demote_0 on rhel64-1 (local) 232 Jan 30 16:09:01 rhel64-1 crmd[8143]: notice: te_rsc_command: Initiating action 80: notify prmPostgresql_pre_notify_demote_0 on rhel64-2 }}} I think that the transition abort at (2) should not happen. Regards, -- Keisuke MORI logs-pre-notify-20140206.tar.bz2___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc
Re: [Pacemaker] Time to get ready for 1.1.11
: crm_report: Suppress logging errors after the target directory has been compressed Fix: crm_attribute: Do not swallow hostname lookup failures Fix: crmd: Avoid deleting the 'shutdown' attribute Log: attrd: Quote attribute names Doc: Pacemaker_Explained: Fix formatting A new release candidate for pacemaker 1.1.11 is now available, pacemaker-1.1.11-rc4. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc4 The lrmd crash has finally been resolved. That was the last fix we were waiting on before officially releasing Pacemaker v1.1.11. RC4 is likely going to be the 1.1.11 final release. Please test and report any regressions as soon as possible. 1.1.11 will be released mid next week if no major issues are encountered. CHANGES RC3 to RC4 Fix: stonith_admin: Ensure pointers passed to sscanf() are properly initialized Fix: Prevent potential use-of-NULL Fix: upstart: Allow comilation with glib versions older than 2.28 Fix: services: Fixes segfault associated with cancelling in-flight recurring operations. Low: crmd: Change the default value of node-action-limit Thanks, -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Time to get ready for 1.1.11
Hi Andrew, 2014/1/16 Andrew Beekhof and...@beekhof.net: On 16 Jan 2014, at 3:00 pm, Keisuke MORI keisuke.mori...@gmail.com wrote: Hi, Just curious, I found that RC4 has been branched out of the master after RC3. What would the fixes only in the master branch be in the future? Are they going to be merged into 1.1.12 someday and just skipping 1.1.11? Yes. Unlike last time we're trying to be better about not merging new features and other risky changes during the RC phase :-) The branching should have happened earlier but I forgot. Or are they separated for the next major version such as v1.2 or v2.0? I think our plans for 1.2/2.0 are on hold indefinitely. Its all 1.1.x releases for the foreseeable future. For a small dev team, the benefits didn't outweigh the costs. Thank you for the answer. That makes things clear to me. Regards, Thanks, 2014/1/16 David Vossel dvos...@redhat.com: - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tuesday, January 7, 2014 4:50:11 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, December 19, 2013 2:25:00 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote: David/Andrew, Once 1.1.11 final is released, is it considered the new stable series of Pacemaker, yes or should 1.1.10 still be used in very stable/critical production environments? Thanks, Andrew - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, December 11, 2013 3:33:46 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, November 20, 2013 9:02:40 PM Subject: [Pacemaker] Time to get ready for 1.1.11 With over 400 updates since the release of 1.1.10, its time to start thinking about a new release. Today I have tagged release candidate 1[1]. The most notable fixes include: + attrd: Implementation of a truely atomic attrd for use with corosync 2.x + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Use the load on our peers to know how many jobs to send them + crm_mon: add --hide-headers option to hide all headers + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol + services: Detect missing agents and permission errors before forking + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol[2] that are present in this release. [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1 [2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. If you haven't already, install Pacemaker's dependancies [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep 1. Build Pacemaker # make rc 1. Copy the rpms and deploy as needed A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2 Assuming no major regressions are encountered during testing, this tag will become the final Pacemaker-1.1.11 release a week from today. -- Vossel Alright, New RC time. Pacemaker-1.1.11-rc3. If no regressions are encountered, rc3 will become the 1.1.11 final release a week from today
Re: [Pacemaker] [Problem][crmsh]The designation of the 'ordered' attribute becomes the error.
Hi Dejan, Andreas, Yamauchi-san 2013/4/18 renayama19661...@ybb.ne.jp Hi Dejan, Hi Andreas, The shell in pacemaker v1.0.x is in maintenance mode and shipped along with the pacemaker code. The v1.1.x doesn't have the ordered and collocated meta attributes. I sent the pull request of the patch which Mr. Dejan donated. * https://github.com/ClusterLabs/pacemaker-1.0/pull/14 The patch for crmsh is now included in the 1.0.x repository: https://github.com/ClusterLabs/pacemaker-1.0/commit/9227e89fb748cd52d330f5fca80d56fbd9d3efbf It will be appeared in 1.0.14 maintenance release, which is not scheduled yet though. Thanks, Keisuke MORI Many Thanks! Hideo Yamauchi. --- On Tue, 2013/4/2, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Mon, Apr 01, 2013 at 09:19:51PM +0200, Andreas Kurz wrote: Hi Dejan, On 2013-03-06 11:59, Dejan Muhamedagic wrote: Hi Hideo-san, On Wed, Mar 06, 2013 at 10:37:44AM +0900, renayama19661...@ybb.ne.jpwrote: Hi Dejan, Hi Andrew, As for the crm shell, the check of the meta attribute was revised with the next patch. * http://hg.savannah.gnu.org/hgweb/crmsh/rev/d1174f42f4b3 This patch was backported in Pacemaker1.0.13. * https://github.com/ClusterLabs/pacemaker-1.0/commit/fa1a99ab36e0ed015f1bcbbb28f7db962a9d1abc#shell/modules/cibconfig.py However, the ordered,colocated attribute of the group resource is treated as an error when I use crm Shell which adopted this patch. -- (snip) ### Group Configuration ### group master-group \ vip-master \ vip-rep \ meta \ ordered=false (snip) [root@rh63-heartbeat1 ~]# crm configure load update test2339.crm INFO: building help index crm_verify[20028]: 2013/03/06_17:57:18 WARN: unpack_nodes: Blind faith: not fencing unseen nodes WARNING: vip-master: specified timeout 60s for start is smaller than the advised 90 WARNING: vip-master: specified timeout 60s for stop is smaller than the advised 100 WARNING: vip-rep: specified timeout 60s for start is smaller than the advised 90 WARNING: vip-rep: specified timeout 60s for stop is smaller than the advised 100 ERROR: master-group: attribute ordered does not exist - WHY? Do you still want to commit? y -- If it chooses `yes` by a confirmation message, it is reflected, but it is a problem that error message is displayed. * The error occurs in the same way when I appoint colocated attribute. AndI noticed that there was not explanation of ordered,colocated of the group resource in online help of Pacemaker. I think that the designation of the ordered,colocated attribute should not become the error in group resource. In addition, I think that ordered,colocated should be added to online help. These attributes are not listed in crmsh. Does the attached patch help? Dejan, will this patch for the missing ordered and collocated group meta-attribute be included in the next crmsh release? ... can't see the patch in the current tip. The shell in pacemaker v1.0.x is in maintenance mode and shipped along with the pacemaker code. The v1.1.x doesn't have the ordered and collocated meta attributes. Thanks, Dejan Thanks Regards, Andreas Thanks, Dejan Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
Re: [Pacemaker] Compilation problem in centos 6.3
Hi, I've seen a similar problem. It was caused by an unseen escape sequence produced by the crm shell (readline library in particular) when TERM=xterm. Try export TERM=vt100 and rebuild it. Or grab the latest crm shell. 2012/9/4 Miguel Angel Guerrero miguel.guerr...@itac.com.co Hi all I'm trying to compile the last version of the src package of pacemaker from red hat in centos 6.3 with snmp support but i get this error in the process and i don't understand this, i check the dependencies and all is correct, thanks for your help Processing file tmp/en-US/xml_tmp/Ch-Tools.xml - tmp/en-US/xml/Ch-Tools.xml not well-formed (invalid token) at line 20, column 8, byte 884: paraTake some time to familiarize yourself with what it can do./para para# commandcrm --help/command/para screen ===^ usage: crm [-D display_type] [-f file] [-hF] [args] at /usr/lib64/perl5/XML/Parser.pm line 187 gmake[1]: *** [Clusters_from_Scratch.txt] Error 255 gmake[1]: Leaving directory `/home/itac/rpmbuild/BUILD/ClusterLabs-pacemaker-148fccf/doc' make: *** [all-recursive] Error 1 error: Estado de salida erróneo de /var/tmp/rpm-tmp.yHX51k (%build) Errores de construcción RPM: InstallSourcePackage at: psm.c:244: CabeceraV3 RSA/SHA256 Signature, ID de clave fd431d51: NOKEY Estado de salida erróneo de /var/tmp/rpm-tmp.yHX51k (%build) -- www.itac.com.co Miguel Angel Guerrero Ingeniero de Infraestructura ITAC - IT Applications Consulting Avenida 19 # 114 – 65 Oficina 215 Bogota, DC. Colombia Telefono (+571) 6400338 Ext. 147 miguel.guerr...@itac.com.co Nuestros clientes hacen parte de la calidad que brindamos por lo que lo invitamos a que si tiene una queja, reclamo o sugerencia nos la haga saber a cali...@itac.com.co ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] LF#2605 pingd misdetection of failure when kernel.pid_max 65536
Hi, I've filed one issue to bugzilla. Please find the detail and the patch on it. LF#2605 pingd misdetection of failure when kernel.pid_max 65536 http://developerbugs.linux-foundation.org/show_bug.cgi?id=2605 Thanks, -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] HealthSMART RA fix devices option
Hi, HealthSMART RA seems not working properly when you specified 'devices' option. Suggest the attached patch. Thanks, -- Keisuke MORI HealthSMART: fix devices option diff -r cf4e9febed8e extra/resources/HealthSMART --- a/extra/resources/HealthSMART Wed Feb 23 14:52:34 2011 +0100 +++ b/extra/resources/HealthSMART Wed Mar 02 18:45:22 2011 +0900 @@ -254,13 +254,13 @@ HealthSMART_monitor() { # Check drive temperature(s) if [ ${OCF_RESKEY_devices} ]; then for DEVICE in ${OCF_RESKEY_devices}; do - check_temperature `$SMARTCTL $DEVICE -A ${DRIVE} | awk '/^194/ { print $10 }'` + check_temperature `$SMARTCTL -d $DEVICE -A ${DRIVE} | awk '/^194/ { print $10 }'` if [ $? != 0 ]; then return $OCF_SUCCESS fi done else - check_temperature `$SMARTCTL $DEVICE -A ${DRIVE} | awk '/^194/ { print $10 }'` + check_temperature `$SMARTCTL -A ${DRIVE} | awk '/^194/ { print $10 }'` if [ $? != 0 ]; then return $OCF_SUCCESS fi ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker-1.1.4, when?
Hi Andrew, 2010/11/15 Andrew Beekhof and...@beekhof.net: If someone can fix the patch so that the regression tests pass I'll apply it, but I won't have any time to work on it for at least a few weeks. I've been trying to write a patch for this, and it almost works fine, but I found that it is very hard to make it 100% compatible with the latest glib2 because of the implementation difference of GHashTable between glib2-2.12(RHEL5) and glib2-2.26. The attached patch almost works well, except that the regression tests fails on 3 items regarding to the utilization test cases. (the patch and the failed diff are attached) Test utilization-order1: Utilization Order - Simple * FAILED: xml-file changed Test utilization-order2: Utilization Order - Complex * FAILED: xml-file changed Test utilization-order3: Utilization Order - Migrate * FAILED: xml-file changed * ERROR: Results of 3 failed tests (out of 293) are in ./.regression.failed.diff It seems only the difference of the processing order of the nodes at pengine/native.c:native_internal_constraints(). This difference comes because GHashTable is implemented differently, where glib2-2.12 uses a linked list for iteration, while glib2-2.26 no longer uses a linked list and just go through an array. I think that the possible options that we can take are: 1) Apply the patch and just ignore the errors on RHEL5 - as long as they're considered harmless. 2) Sort the node list when creating the graph of the utilization - although it may cause another performance penalty. 3) Revert using GList for the node list - if the node lookup is not the major factor of the performance issue. It's all up to you. Hope it helps. Thanks, Keisuke MORI On Mon, Nov 15, 2010 at 2:59 AM, nozawat noza...@gmail.com wrote: Hi Andrew and Nikola, Oneself carried out regression test, too, and an error was given equally. Regards, Tomo 2010/11/12 Nikola Ciprich extmaill...@linuxbox.cz (resent) 1.1.4 with new glib2: tests pass smoothly 1.1.4 + patch and older glib2 - all tests are segfaulting... ie: Program terminated with signal 11, Segmentation fault. #0 IA__g_str_hash (v=0x0) at gstring.c:95 95 guint32 h = *p; (gdb) bt #0 IA__g_str_hash (v=0x0) at gstring.c:95 #1 0x7fe087bb6128 in g_hash_table_lookup_node (hash_table=0x1390ec0, key=0x0, value=0x13a3b00) at ghash.c:231 #2 IA__g_hash_table_insert (hash_table=0x1390ec0, key=0x0, value=0x13a3b00) at ghash.c:336 #3 0x7fe089367953 in convert_graph_action (resource=0x13a30a0, action=0x139cb80, status=0, rc=7) at unpack.c:308 #4 0x0040362a in exec_rsc_action (graph=0x1394fa0, action=0x139cb80) at crm_inject.c:359 #5 0x7fe089368642 in initiate_action (graph=0x1394fa0, action=0x139cb80) at graph.c:172 #6 0x7fe08936899d in fire_synapse (graph=0x1394fa0, synapse=0x139ba60) at graph.c:204 #7 0x7fe089368dbd in run_graph (graph=0x1394fa0) at graph.c:262 #8 0x0040428f in run_simulation (data_set=0x7fff712280a0) at crm_inject.c:540 #9 0x0040632a in main (argc=9, argv=0x7fff71228308) at crm_inject.c:1148 -- Keisuke MORI # HG changeset patch # User Keisuke MORI kskm...@intellilink.co.jp # Date 1290657182 -32400 # Node ID d0b7749d477fe9048c2edd877c07a411282540e5 # Parent 6407a7137b5748d6375083f0be843c198b3d95d2 [mq]: glib2.patch diff -r 6407a7137b57 -r d0b7749d477f configure.ac --- a/configure.ac Fri Nov 19 18:19:03 2010 +0100 +++ b/configure.ac Thu Nov 25 12:53:02 2010 +0900 @@ -654,7 +654,7 @@ AC_MSG_RESULT(using $GLIBCONFIG) AC_CHECK_LIB(glib-2.0, g_hash_table_get_values) if test x$ac_cv_lib_glib_2_0_g_hash_table_get_values != xyes; then - AC_MSG_ERROR(Your version of Glib is too old, you need at least 2.14) + AC_MSG_WARN(Your version of Glib is too old, you should have at least 2.14) fi # diff -r 6407a7137b57 -r d0b7749d477f include/crm/common/util.h --- a/include/crm/common/util.h Fri Nov 19 18:19:03 2010 +0100 +++ b/include/crm/common/util.h Thu Nov 25 12:53:02 2010 +0900 @@ -298,4 +298,69 @@ extern int node_score_infinity; extern xmlNode *create_operation_update(xmlNode *parent, lrm_op_t *op, const char *caller_version, int target_rc, const char *origin, int level); extern void free_lrm_op(lrm_op_t *op); +#if HAVE_LIBGLIB_2_0 + +#else + +typedef struct fake_ghi +{ +GHashTable *hash; +int nth; /* current index over the iteration */ +int lpc; /* internal loop counter inside g_hash_table_find */ +gpointer key; +gpointer value; +} GHashTableIter; + +static inline void g_hash_prepend_value(gpointer key, gpointer value, gpointer user_data) +{ +GList **values = (GList **)user_data; +*values = g_list_prepend(*values, value); +} + +static inline GList *g_hash_table_get_values(GHashTable *hash_table) +{ +GList *values
Re: [Pacemaker] Project updates
2010/11/18 Andrew Beekhof and...@beekhof.net: On Tue, Nov 16, 2010 at 9:13 AM, Andrew Beekhof and...@beekhof.net wrote: On Tue, Nov 16, 2010 at 8:07 AM, Keisuke MORI keisuke.mori...@gmail.com wrote: STABLE_SERIES = stable-1.0 RPM_ROOT = $(shell pwd) diff -r 99f5a1e61667 configure.ac --- a/configure.ac Fri Nov 12 09:12:32 2010 +0100 +++ b/configure.ac Fri Nov 12 11:47:28 2010 -0500 @@ -19,7 +19,7 @@ dnl checks for library functions dnl checks for system services -AC_INIT(pacemaker, 1.0.9, pacemaker@oss.clusterlabs.org) +AC_INIT(pacemaker, 1.0.10, pacemaker@oss.clusterlabs.org) thats kinda annoying but not crucial. thanks for pointing it out This would be confusing for users to tell which version they're actually using when they are going to report a problem because all the logs and crm_mon output shows the version as 1.0.9. Any chance of the release for another RPMs with this fix? Oh, I forgot about crm_mon. I'll see what I can do. Happily a change to the spec file is all that was needed. New rpms should be available for all platforms Great! It would greatly help to me and my customers, too. Thanks, -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] [FYI] Required errata for RHEL5
Hi all, For your information for RedHat users, As a conclusion of testing in my company, we consider that the following errata should be applied on RHEL5.5 or below in order to get Pacemaker work more stably. http://rhn.redhat.com/errata/RHBA-2010-0764.html Due to a malfunction in libxml2-2.6.26 which comes with RHEL5.5, Pacemaker may fail to update the cib information. If you saw an error log like this, you're hitting this issue: --- attrd: [16708]: ERROR: attrd_cib_callback: Update -40 for default_ping_set=100 failed: Required data for this CIB API call not found --- Or you may get an error from a crm_* command like this: --- # crm_standby -U node1 -v off Please choose from one of the matches above and suppy the 'id' with --attr-id --- It happened only in limited situations as long as we've tested, but technically it randomly happens. The bug in libxml2 seems have been already fixed in libxml2-2.6.27 or later according to the git log. Related discussions and links: http://www.gossamer-threads.com/lists/linuxha/pacemaker/61182 https://bugzilla.redhat.com/show_bug.cgi?id=613860 http://git.gnome.org/browse/libxml2/commit/?id=6422d916d929cb8653d950d4b424388a7ea7230d Acknowledgments: Many thanks to Hideo Yamauchi and RedHat support for resolving this issue. I hope it helps to all. Thanks, -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm_mon and pingd
No objections. I've push the changeset below: http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/53132ed532ea But it would be still preferable to rely on -A, particularly if you want to use two or more pingd resources or change the attribute name. I'd also agree for the filtering feature as an enhancement. Thanks, 2010/11/10 Andrew Beekhof and...@beekhof.net: Any objections Mori-san? Seems like a reasonable change to me. On Tue, Nov 9, 2010 at 1:26 PM, Vadym Chepkov vchep...@gmail.com wrote: Would it be too much harm to restore the previous behavior at least partially? diff -r 7f2e453eedfa -r ab2da8a98b47 tools/crm_mon.c --- a/tools/crm_mon.c Mon Nov 08 23:13:17 2010 +0100 +++ b/tools/crm_mon.c Tue Nov 09 07:18:53 2010 -0500 @@ -748,6 +748,17 @@ g_list_free(sorted_op_list); } +static void get_ping_score(node_t *node, pe_working_set_t *data_set) +{ + const char *attr = pingd; + const char *value = NULL; + value = g_hash_table_lookup(node-details-attrs, attr); + + if(value != NULL) { + print_as( %s=%s, attr, value); + } +} + static void print_attr_msg(node_t *node, GListPtr rsc_list, const char *attrname, const char *attrvalue) { slist_iter(rsc, resource_t, rsc_list, lpc2, @@ -848,6 +859,9 @@ } print_as(* Node %s: , crm_element_value(node_state, XML_ATTR_UNAME)); + if(!print_nodes_attr) { + get_ping_score(node, data_set); + } print_as(\n); lrm_rsc = find_xml_node(node_state, XML_CIB_TAG_LRM, FALSE); Thanks, Vadym ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm_mon and pingd
Hi Vadym, Could you provide the output of 'cibadmin -Q' to see what's happening over there? Thanks, 2010/11/4 Vadym Chepkov vchep...@gmail.com: Hi, It seems this patch in pacemaker doesn't work as expected changeset: 15672:4d50adc3ccd9 branch: stable-1.0 user: Andrew Beekhof and...@beekhof.net date: Mon May 10 10:26:50 2010 +0200 summary: Medium: tools: crm_mon - Enable 'connectivity' mode for 'ping' resources too crm_mon doesn't show pingd attribute value in Migration summary: anymore. Version: 1.0.9-0a40fd0cb9f2fcedef9d1967115c912314c57438 Thanks, Vadym ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] About behavior in Action Lost.
2010/10/7 Andrew Beekhof and...@beekhof.net: On Thu, Oct 7, 2010 at 11:48 AM, Keisuke MORI keisuke.mori...@gmail.com wrote: Andrew, 2010/9/23 Andrew Beekhof and...@beekhof.net: Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. I would like to backport this to 1.0. Would you agree with this? I would prefer not to, but if it is important to you then I will agree. Thank you for your ACK. It's now in 1.0. http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/146e405c1afa -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resource stop timeout broken in 1.0 branch tip
2010/10/9 Andrew Beekhof and...@beekhof.net: On Fri, Oct 8, 2010 at 4:17 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Wed, Oct 06, 2010 at 06:29:12PM +0900, Keisuke MORI wrote: 2010/10/6 Andrew Beekhof and...@beekhof.net: Is there more changesets that need to be backported regarding to this issues? There is now that Andreas brought the problem to my attention :-) http://hg.clusterlabs.org/pacemaker/1.1/rev/e097c70226fe If not, I think that the Andreas' patch should be applied to 1.0. It seems to me that the patch is sane as it would restore the old behavior for the stop operation with having the resource attributes as the first patch intended. See the comment in the above patch. Andreas' original patch wouldn't have worked if the resource definition changed. I see, I will backport this to 1.0 too. Done. http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0d019d9e9c61 May I take the oportunity to point you to http://hg.clusterlabs.org/pacemaker/1.1/rev/3f8df3dfb328 ACK, no objection to this being backported :-) Also done, along with a minor compilation fix. http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/70438ddd4351 http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0a40fd0cb9f2 -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] About behavior in Action Lost.
Andrew, 2010/9/23 Andrew Beekhof and...@beekhof.net: Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. I would like to backport this to 1.0. Would you agree with this? Without this the failed node was not fenced when it ought to be and failed to continue the service. I would also think that it would be good to have the same behavior between 1.0 and 1.1 in such a critical condition to support both versions better. Thanks, Keisuke MORI On Wed, Sep 22, 2010 at 11:18 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Sorry... I did not know the fact that there was such an argument in old days. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? That means, you change it in the case of Action Lost of the stop this time to carry out stonith? If my recognition is right, I agree too. if(timer-action-type != action_type_rsc) { send_update = FALSE; } else if(safe_str_eq(task, cancel)) { /* we dont need to update the CIB with these */ send_update = FALSE; } --- delete else if(safe_str_eq(task, stop)){..} ? if(send_update) { /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */ cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); } Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Tue, Sep 21, 2010 at 8:59 AM, renayama19661...@ybb.ne.jp wrote: Hi, Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker. Action Lost occurred in stop movement after the error of the monitor occurred. Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost: [Action 9]: In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost For the load of the node, We think that the stop movement did not go well. But cannot nodes execute stonith. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Can somebody please explain pengine's urge to move all resources?
Hi Raoul, 2010/9/28 Andrew Beekhof and...@beekhof.net: On Tue, Sep 28, 2010 at 11:48 AM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote: On 09/23/2010 09:28 AM, Andrew Beekhof wrote: The good news is that 1.1.3 doesn't have that behavior. Lets see how 1.0 goes once all the relevant patches have been backported. thanks for your answer! will those patches make it into 1.0.10 or do you have another eta for this? This should have been fix with this: http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/5fe02f48c47b The patch has been already backported to the 1.0 repository and will be included in 1.0.10. Will you test with the tip of 1.0 repository if you have any chance? Thanks, MORI-san from NTT is currently working on the backports. we'll delay .10 until he has a chance to complete the process :-) -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resource stop timeout broken in 1.0 branch tip
2010/10/6 Andrew Beekhof and...@beekhof.net: Is there more changesets that need to be backported regarding to this issues? There is now that Andreas brought the problem to my attention :-) http://hg.clusterlabs.org/pacemaker/1.1/rev/e097c70226fe If not, I think that the Andreas' patch should be applied to 1.0. It seems to me that the patch is sane as it would restore the old behavior for the stop operation with having the resource attributes as the first patch intended. See the comment in the above patch. Andreas' original patch wouldn't have worked if the resource definition changed. I see, I will backport this to 1.0 too. Thanks, -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resource stop timeout broken in 1.0 branch tip
2010/10/2 Andreas Hofmeister a...@collax.com: Hi, it seems to me that patch http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/8241f689bf9f broke timeouts for stop operations. The observable effect is that the timeout for stop operations is always 125s, regardless what was specified in the CIB. Reverting the part of the patch that changes crmd/lrm.c seems to fix the problem. The attached patch reverts the change to crmd/lrm.c and also http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/66df1404fdcb which dealt with another side effect of the change in crmd/lrm.c . Hi Andreas, You're right and I confirmed that the problem exists in the 1.0 tip and the problem goes away with your patch. Andrew, Doesn't this problem exist in 1.1? Is there more changesets that need to be backported regarding to this issues? If not, I think that the Andreas' patch should be applied to 1.0. It seems to me that the patch is sane as it would restore the old behavior for the stop operation with having the resource attributes as the first patch intended. Thanks, Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Memory leaks in pacemaker-1.0.8
Hi, Pacemaker-1.0.8 seems to have some memory leakage. Please find the detail on the bugzilla item: http://developerbugs.linux-foundation.org/show_bug.cgi?id=2386 Regards, -- Keisuke MORI ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Pseudo RAs do not work properly on Corosync stack
Hi, Sorry for a bit long mail. I'm going to describe the issue of the Subject: and would like to suggest some changes to the agents package (and possibly Pacemaker, too). I would be grad if you could give me your thought and comments. A pseudo RA which creates a stat file under HA_RSCTMP (/var/run/heartbeat/rsctmp), such as Dummy, MailTo, etc. do not work properly on the Pacemaker+Corosync stack. When a node crashed and was rebooted, a stale stat file is left over the reboot and hence the RA misbehaves as if the resource was already started when the cluster is launched again for the recovery. This problem does not occur on Heartbeat stack because Heartbeat removes HA_RSCTMP when its startup, while on Pacemaker stack none of Pacemaker/Corosync removes it. But removing them by Pacemaker does not seem to be correct - if they were removed at the cluster startup time then the maintenance mode would no longer work properly. In my understanding, the correct behavior is: - They should NOT be removed at the cluster startup time. - They should be removed at the OS bootup time. My suggestion to address this issue is, to fix as the following; - 1) change the HA_RSCTMP location to /var/run/resource-agents, or wherever a subdirectory right under /var/run. - 2) having the directory permission as 01777 (with sticky bit) - 3) change IPaddr/SendArp RA not to use its own subdirectory but instead, add a prefix for the filename. - 4) make /var/run/heartbeat/rsctmp as obsolete; Heartbeat/Pacemaker could preserve the current behavior for a while for the compatibility. The basic idea of the changes is that, we're now going to follow the file removal procedure defined by FHS(Filesystem Hierarchy Standard). http://www.pathname.com/fhs/pub/fhs-2.3.html#VARRUNRUNTIMEVARIABLEDATA FHS defines that any files under a subdirectory of /var/run should be removed at the OS bootup time. Unfortunately the second level subdirectory is out of the scope and you can not rely on the removal (and that's the case of /var/run/heartbeat/rsctmp). I believe that the impacts for existing RAs are minimum. If your RA is implemented correctly then you need to do nothing - just notice that the location of the stat file is changed. If your RA has hardcoded /var/run/heartbeat/rsctmp, or it creates its own subdirectory, it is encouraged to fix because it may not work well with the maintenance mode, but you can continue to use the old rsctmp if you would like. I would like to hear your thought and comments. Regards, -- Keisuke MORI ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker