Re: [Pacemaker] Enable remote monitoring
On 12/12/12 15:38, Gao,Yan wrote: On 12/12/12 11:14, Gao,Yan wrote: On 12/12/12 01:53, David Vossel wrote: - Original Message - From: Yan Gao y...@suse.com To: pacemaker@oss.clusterlabs.org Sent: Tuesday, December 11, 2012 1:23:03 AM Subject: Re: [Pacemaker] Enable remote monitoring Hi, Here's the latest code: https://github.com/gao-yan/pacemaker/commit/4d58026c2171c42385c85162a0656c44b37fa7e8 Now: - container-type: * black - ordering, colocating * white - ordering Both them are not probed so far. I think for the sake of this implementation we should ignore the whitebox use case for now. There are aspects of the whitebox use case that I'm just not sure about yet, and I don't want to hold you all up trying to define that. I don't mind re-approaching this container concept and expanding it to the whitebox use case later on building with what you have here. I'm in favor of removing the container-type letting the blackbox use case be the default for now, and I'll go in and do our whitebox bits later. Hmm, this might be better before we have a clear definition for whitebox. Removed container-type for now. Pushed with several regression tests: Sorry, forgot the link: https://github.com/gao-yan/pacemaker/commits/container Regards, Gao,Yan -- Gao,Yan y...@suse.com Software Engineer China Server Team, SUSE. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Suggestion to improve movement of booth
Hi Yusuke, On Fri, 2012-11-30 at 21:30 +0900, yusuke iida wrote: Hi, Jiaju When communication of a part of proposer and acceptor goes out, re-acquisition of lease is temporarily performed by proposer. Since a ticket will be temporarily revoke(d) at this time, service will stop temporarily. I think that this is a problem. I hope that lease of the ticket is held. This is what I wanted to do as well;) That is to say, the lease should keep renewing on the original site successfully unless it was down. Current implementation is to let the original site renew the ticket before ticket lease expires (only when lease expires the ticket will be revoked), hence, before other sites tries to acquire the ticket, the original site has renewed the ticket already, so the result is the ticket is still on that site. I'm not quite understand your problem here. Is that the lease not keeping in the original site? Thanks, Jiaju I thought about a plan to prevent movement to become the reaccession of lease. When proposer continues updating lease, I think that you should refuse a message from new proposer. In order to remain at the existing behavior, I want to be switched according to the setting. I wrote the patch about this proposal. https://github.com/yuusuke/booth/commit/6b82fda7b4220c418ff906a9cf8152fe88032566 What do you think about this proposal? Best regards, Yuusuke -- METRO SYSTEMS CO., LTD Yuusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node status does not change even if pacemakerd dies
(12.12.06 12:18), Andrew Beekhof wrote: On Wed, Dec 5, 2012 at 8:32 PM, Kazunori INOUE inouek...@intellilink.co.jp wrote: (12.12.05 02:02), David Vossel wrote: - Original Message - From: Kazunori INOUE inouek...@intellilink.co.jp To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Monday, December 3, 2012 11:41:56 PM Subject: Re: [Pacemaker] node status does not change even if pacemakerd dies (12.12.03 20:24), Andrew Beekhof wrote: On Mon, Dec 3, 2012 at 8:15 PM, Kazunori INOUE inouek...@intellilink.co.jp wrote: (12.11.30 23:52), David Vossel wrote: - Original Message - From: Kazunori INOUE inouek...@intellilink.co.jp To: pacemaker@oss pacemaker@oss.clusterlabs.org Sent: Friday, November 30, 2012 2:38:50 AM Subject: [Pacemaker] node status does not change even if pacemakerd dies Hi, I am testing the latest version. - ClusterLabs/pacemaker 9c13d14640(Nov 27, 2012) - corosync 92e0f9c7bb(Nov 07, 2012) - libqb 30a7871646(Nov 29, 2012) Although I killed pacemakerd, node status did not change. [dev1 ~]$ pkill -9 pacemakerd [dev1 ~]$ crm_mon : Stack: corosync Current DC: dev2 (2472913088) - partition with quorum Version: 1.1.8-9c13d14 2 Nodes configured, unknown expected votes 0 Resources configured. Online: [ dev1 dev2 ] [dev1 ~]$ ps -ef|egrep 'corosync|pacemaker' root 11990 1 1 16:05 ?00:00:00 corosync 496 12010 1 0 16:05 ?00:00:00 /usr/libexec/pacemaker/cib root 12011 1 0 16:05 ?00:00:00 /usr/libexec/pacemaker/stonithd root 12012 1 0 16:05 ?00:00:00 /usr/libexec/pacemaker/lrmd 496 12013 1 0 16:05 ?00:00:00 /usr/libexec/pacemaker/attrd 496 12014 1 0 16:05 ?00:00:00 /usr/libexec/pacemaker/pengine 496 12015 1 0 16:05 ?00:00:00 /usr/libexec/pacemaker/crmd We want the node status to change to OFFLINE(stonith-enabled=false), UNCLEAN(stonith-enabled=true). That is, we want the function of this deleted code. https://github.com/ClusterLabs/pacemaker/commit/dfdfb6c9087e644cb898143e198b240eb9a928b4 How are you launching pacemakerd? The systemd service script relaunches pacemakerd on failure and pacemakerd has the ability to attach to all the old processes if they are still around as if nothing happened. -- Vossel Hi David, We are using RHEL6 and use it for a while after this. Therefore, I start it by the following commands. $ /etc/init.d/pacemakerd start or $ service pacemaker start Ok. Are you using the pacemaker plugin? When using cman or corosync 2.0, pacemakerd isn't strictly needed for normal operation. Its only there to shutdown and/or respawn failed components. We are using corosync 2.1, so service does not stop normally after pacemakerd died. $ pkill -9 pacemakerd $ service pacemaker stop $ echo $? 0 $ ps -ef|egrep 'corosync|pacemaker' root 3807 1 0 13:10 ?00:00:00 corosync 496 3827 1 0 13:10 ?00:00:00 /usr/libexec/pacemaker/cib root 3828 1 0 13:10 ?00:00:00 /usr/libexec/pacemaker/stonithd root 3829 1 0 13:10 ?00:00:00 /usr/libexec/pacemaker/lrmd 496 3830 1 0 13:10 ?00:00:00 /usr/libexec/pacemaker/attrd 496 3831 1 0 13:10 ?00:00:00 /usr/libexec/pacemaker/pengine 496 3832 1 0 13:10 ?00:00:00 /usr/libexec/pacemaker/crmd Ah yes, that is a problem. Having pacemaker still running when the init script says it is down... that is bad. Perhaps we should just make the init script smart enough to check to make sure all the pacemaker components are down after pacemakerd is down. The argument of whether or not the failure of pacemakerd is something that the cluster should be alerted to is something i'm not sure about. With the corosync 2.0 stack, pacemakerd really doesn't do anything except launch processes/relaunch processes. A cluster can be completely functional without a pacemakerd instance running anywhere. If any of the actual pacemaker components on a node fail, the logic that causes that node to get fenced has nothing to do with pacemakerd. -- Vossel Hi, I think that relaunch processes of pacemakerd is a very useful function, so I want to avoid management of a resource in the node in which pacemakerd does not exist. You do understand that the node will be fenced if any of those processes fail right? Its not like a node could end up in a bad state if pacemakerd isn't around to respawn things. The relaunch processes is there in attempt to recover before anyone else notices. So essentially what you're asking for, is to fence the node and migrate all the resources so that in the future IF another process dies, we MIGHT not have to fence the node
Re: [Pacemaker] booth is the state of started on pacemaker before booth write ticket info in cib.
On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote: Hi Jiaju, Currently, booth is the state of started on pacemaker before booth writes ticket information in cib. So, If the old ticket information is included in cib, a resource relating to the ticket may start before booth resets the ticket. I think that this problem is when to be daemon in booth. The resouce should not be started before the booth daemon is ready. We suggest to configure an ordering constraint for the booth daemon and the managed resources by that ticket. That being said, if the ticket is in the CIB but booth daemon has not been started, the resources would not be started. Perhaps, this problem didn't happen before the following commit. https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f Currently when all of the initialization (including loading the new ticket information) finished, booth should be regarded as ready. So if you encounter some problem here, I guess we should improve the RA to better reflect the booth startup status, but not moving the initialization order, since it may introduce other regression as we have encountered before;) Thanks, Jiaju Sincerely, Yuichi -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Enable remote monitoring
On 2012-12-11T12:53:39, David Vossel dvos...@redhat.com wrote: Excellent progress! Just one aspect caught my eye: - on-fail defaults restart-container for most actions, except for stop op (Not sure what it means if a stop fails. A nagios daemon cannot be terminated? Should it always return success?) , A nagios stop action should always return success. The nagio's agent doesn't even need a stop function, the lrmd can know to treat a stop as a (no-op for stop) + (cancel all recurring actions). In this case if the nagios agent doesn't stop successfully, it is because of an lrmd failure which should result in a fencing action i'd imagine. That's something that, IMHO, shouldn't be handled by the container abstraction, but - like you say - by the LRM/class code. I think on-fail=restart-container makes sense even for stop. If stop can't technically fail for a given class, even better. But it could mean that we actually need to stop some monitoring daemon or whatever. The other logic might be to set it to ignore, which would also work for me (even if a bit less obviously). But really I'd not want to make oh let's just skip stop for contained resources here ;-) - Failures of resources count against container's What happens if someone wants to clear the container's failcount? Do we need to add some logic to go in and clear all the child resource's failures as well to make this happen correctly? That appears to make sense. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Moving multi-state resources
Hi, My requirement was to do some administration on one of the nodes where a 2-node multi-state resource was running. To effect a resource instance stoppage on one of the nodes, I added a resource constraint as below: crm configure location ms_stop_res_on_node ms_resource rule -inf: \#uname eq `hostname` The resource cleanly moved over to the other node. Incidentally, the resource was the master on this node and was successfully moved to a master state on the other node too. Now, I want to bring the resource back onto the original node. But the above resource constraint seems to have a persistent behaviour. crm resource unmigrate ms_resource does not seem to undo the effects of the constraint addition. I think the location constraint is preventing the resource from starting on the original node. How do I delete this location constraint now? Is there a more standard way of doing such administrative tasks? The requirement is that I do not want to offline the entire node while doing the administration but rather would want to stop only the resource instance, do the admin work and restart the resource instance on the node. Thanks, Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker processes RSS growth
12.12.2012 05:35, Andrew Beekhof wrote: On Tue, Dec 11, 2012 at 5:49 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 11.12.2012 06:52, Vladislav Bogdanov wrote: 11.12.2012 05:12, Andrew Beekhof wrote: On Mon, Dec 10, 2012 at 11:34 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 10.12.2012 09:56, Vladislav Bogdanov wrote: 10.12.2012 04:29, Andrew Beekhof wrote: On Fri, Dec 7, 2012 at 5:37 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 06.12.2012 09:04, Vladislav Bogdanov wrote: 06.12.2012 06:05, Andrew Beekhof wrote: I wonder what the growth looks like with the recent libqb fix. That could be an explanation. Valid point. I will watch. On a almost static cluster the only change in memory state during 24 hours is +700kb of shared memory to crmd on a DC. Will look after that one for more time. It still grows. ~650-700k per day. I sampled 'maps' and 'smaps' content from crmd's proc and will look what differs there over the time. smaps tells me it may be in /dev/shm/qb-pengine-event-1735-1736-4-data. 1735 is pengine, 1736 is crmd. Diff of that part: @@ -56,13 +56,13 @@ MMUPageSize: 4 kB 7f427fddf000-7f42802df000 rw-s 00:0f 12332 /dev/shm/qb-pengine-event-1735-1736-4-data Size: 5120 kB -Rss:4180 kB -Pss:2089 kB +Rss:4320 kB +Pss:2159 kB Shared_Clean: 0 kB -Shared_Dirty: 4180 kB +Shared_Dirty: 4320 kB Private_Clean: 0 kB Private_Dirty: 0 kB -Referenced: 4180 kB +Referenced: 4320 kB Anonymous: 0 kB AnonHugePages: 0 kB Swap: 0 kB 'Rss' and 'Shared_Dirty' will soon reach 'Size' (now 4792 vs 5120), I'll look what happens then. I expect growth to stop and pages to be reused. If that is true, then there are no any leaks, but rather controlled fill of a buffer of a predefined size. Great. Please let me know how it turns out. Now I see @@ -56,13 +56,13 @@ MMUPageSize: 4 kB 7f427fddf000-7f42802df000 rw-s 00:0f 12332 /dev/shm/qb-pengine-event-1735-1736-4-data Size: 5120 kB -Rss:4180 kB -Pss:2089 kB +Rss:5120 kB +Pss:2559 kB Shared_Clean: 0 kB -Shared_Dirty: 4180 kB +Shared_Dirty: 5120 kB Private_Clean: 0 kB Private_Dirty: 0 kB -Referenced: 4180 kB +Referenced: 5120 kB Anonymous: 0 kB AnonHugePages: 0 kB Swap: 0 kB @@ -70,13 +70,13 @@ MMUPageSize: 4 kB 7f42802df000-7f42807df000 rw-s 00:0f 12332 /dev/shm/qb-pengine-event-1735-1736-4-data Size: 5120 kB -Rss: 0 kB -Pss: 0 kB +Rss: 4 kB +Pss: 1 kB Shared_Clean: 0 kB -Shared_Dirty: 0 kB +Shared_Dirty: 4 kB Private_Clean: 0 kB Private_Dirty: 0 kB -Referenced:0 kB +Referenced:4 kB Anonymous: 0 kB AnonHugePages: 0 kB Swap: 0 kB So, it stuck at 5Mb and does not grow anymore. More, all pacemaker processes on DC for some reason now consume much less shared memory, according to htop, than last time I looked at them. It seems to be due to decrease of referenced pages within some anonymous mappings. Though I do not have idea why was that happened. Ok, the main conclusion I can make is that pacemaker does not have any memory leaks in code paths used by a static cluster. Will try to provide some load now. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Moving multi-state resources
Hi, On Wed, Dec 12, 2012 at 03:50:01PM +0530, pavan tc wrote: Hi, My requirement was to do some administration on one of the nodes where a 2-node multi-state resource was running. To effect a resource instance stoppage on one of the nodes, I added a resource constraint as below: crm configure location ms_stop_res_on_node ms_resource rule -inf: \#uname eq `hostname` The resource cleanly moved over to the other node. Incidentally, the resource was the master on this node and was successfully moved to a master state on the other node too. Now, I want to bring the resource back onto the original node. But the above resource constraint seems to have a persistent behaviour. crm resource unmigrate ms_resource does not seem to undo the effects of the constraint addition. You can try to remove your constraint: crm configure delete ms_stop_res_on_node migrate/unmigrate generate/remove special constraints. Thanks, Dejan I think the location constraint is preventing the resource from starting on the original node. How do I delete this location constraint now? Is there a more standard way of doing such administrative tasks? The requirement is that I do not want to offline the entire node while doing the administration but rather would want to stop only the resource instance, do the admin work and restart the resource instance on the node. Thanks, Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Moving multi-state resources
On Wed, Dec 12, 2012 at 6:46 PM, Dejan Muhamedagic deja...@fastmail.fmwrote: Hi, On Wed, Dec 12, 2012 at 03:50:01PM +0530, pavan tc wrote: Hi, My requirement was to do some administration on one of the nodes where a 2-node multi-state resource was running. To effect a resource instance stoppage on one of the nodes, I added a resource constraint as below: crm configure location ms_stop_res_on_node ms_resource rule -inf: \#uname eq `hostname` The resource cleanly moved over to the other node. Incidentally, the resource was the master on this node and was successfully moved to a master state on the other node too. Now, I want to bring the resource back onto the original node. But the above resource constraint seems to have a persistent behaviour. crm resource unmigrate ms_resource does not seem to undo the effects of the constraint addition. You can try to remove your constraint: crm configure delete ms_stop_res_on_node That did the job. Thanks a ton! Pavan migrate/unmigrate generate/remove special constraints. Thanks, Dejan I think the location constraint is preventing the resource from starting on the original node. How do I delete this location constraint now? Is there a more standard way of doing such administrative tasks? The requirement is that I do not want to offline the entire node while doing the administration but rather would want to stop only the resource instance, do the admin work and restart the resource instance on the node. Thanks, Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Listing resources by attributes
Hi, Is there a way in which resources can be listed based on some attributes? For example, listing resource running on a certain node, or listing ms resources. The crm_resource manpage talks about the -N and -t options that seem to address the requirements above. But they do not provide the expected result. crm_resource --list or crm_resource --list-raw give the same output immaterial of whether it was provided with -N or -t. I had to do the following to pull out 'ms' resources, for example: crm configure show | grep -w ^ms | awk '{print $2}' Is there a cleaner way to list resources? Thanks, Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Action from a different CRMD transition results in restarting services
Hi, I run into the following issue and I couldn't find what it really means: Detected action msgbroker_monitor_1 from a different transition: 16048 vs. 18014 I can see that its impact is to stop/start a service but I'd like to understand it a bit more. Thank you in advance for any information. Logs about this issue: ... Dec 6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: Detected action msgbroker_monitor_1 from a different transition: 16048 vs. 18014 Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: process_graph_event:477 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=msgbroker_monitor_1, magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692, cib=0.971.5) : Old event Dec 6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating failcount for msgbroker on Node0 after failed monitor: rc=7 (update=value++, time=1354852505) Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069: Requesting the current CIB: S_POLICY_ENGINE Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair, id=status-Node0-fail-count-msgbroker, magic=NA, cib=0.971.6) : Transient attribute: update Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070: Requesting the current CIB: S_POLICY_ENGINE Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair, id=status-Node0-last-failure-msgbroker, magic=NA, cib=0.971.7) : Transient attribute: update Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071: Requesting the current CIB: S_POLICY_ENGINE Dec 6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating hash entry for last-failure-msgbroker Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback: Invoking the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12, quorate=1 Dec 6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: Operation txpublisher_monitor_0 found resource txpublisher active on Node1 Dec 6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing failed op msgbroker_monitor_1 on Node0: not running (7) ... Dec 6 22:55:05 Node1 pengine: [5233]: notice: common_apply_stickiness: msgbroker can fail 99 more times on Node0 before being forced off ... Dec 6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp: Start recurring monitor (10s) for msgbroker on Node0 ... Dec 6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover msgbroker (Started Node0) ... Dec 6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating action 37: stop msgbroker_stop_0 on Node0 Transition 18014 details: Dec 6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message: Transition 18014: PEngine Input stored in: /var/lib/pengine/pe-input-3270.bz2 Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Dec 6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked transition 18014: 0 actions in 0 synapses Dec 6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing graph 18014 (ref=pe_calc-dc-1354852338-39406) derived from /var/lib/pengine/pe-input-3270.bz2 Dec 6 22:52:18 Node1 crmd: [5235]: info: run_graph: Dec 6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition 18014 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-3270.bz2): Complete Dec 6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition 18014 is now complete Dec 6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition 18014 status: done - null Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: Starting PEngine Recheck Timer Youssef ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker processes RSS growth
On Wed, Dec 12, 2012 at 11:17 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Ok, the main conclusion I can make is that pacemaker does not have any memory leaks in code paths used by a static cluster. Huzah! :) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node status does not change even if pacemakerd dies
On Wed, Dec 12, 2012 at 8:02 PM, Kazunori INOUE inouek...@intellilink.co.jp wrote: Hi, I recognize that pacemakerd is much less likely to crash. However, a possibility of being killed by OOM_Killer etc. is not 0%. True. Although we just established in another thread that we don't have any leaks :) So I think that a user gets confused. since behavior at the time of process death differs even if pacemakerd is running. case A) When pacemakerd and other processes (crmd etc.) are the parent-child relation. [snip] For example, crmd died. However, since it is relaunched, the state of the cluster is not affected. Right. [snip] case B) When pacemakerd and other processes are NOT the parent-child relation. Although pacemakerd was killed, it assumed the state where it was respawned by Upstart. $ service corosync start ; service pacemaker start $ pkill -9 pacemakerd $ ps -ef|egrep 'corosync|pacemaker|UID' UID PID PPID C STIME TTY TIME CMD root 21091 1 1 14:52 ? 00:00:00 corosync 49621099 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/cib root 21100 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/stonithd root 21101 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/lrmd 49621102 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/attrd 49621103 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/pengine 49621104 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/crmd root 21128 1 1 14:53 ? 00:00:00 /usr/sbin/pacemakerd Yep, looks right. In this case, the node will be set to UNCLEAN if crmd dies. That is, the node will be fenced if there is stonith resource. Which is exactly what happens if only pacemakerd is killed with your proposal. Except now you have time to do a graceful pacemaker restart to re-establish the parent-child relationship. If you want to compare B with something, it needs to be with the old children terminate if pacemakerd dies strategy. Which is: $ service corosync start ; service pacemaker start $ pkill -9 pacemakerd ... the node will be set to UNCLEAN Old way: always downtime because children terminate which triggers fencing Our way: no downtime unless there is an additional failure (to the cib or crmd) Given that we're trying for HA, the second seems preferable. $ pkill -9 crmd $ crm_mon -1 Last updated: Wed Dec 12 14:53:48 2012 Last change: Wed Dec 12 14:53:10 2012 via crmd on dev2 Stack: corosync Current DC: dev2 (2472913088) - partition with quorum Version: 1.1.8-3035414 2 Nodes configured, unknown expected votes 0 Resources configured. Node dev1 (2506467520): UNCLEAN (online) Online: [ dev2 ] How about making behavior selectable with an option? MORE_DOWNTIME_PLEASE=(true|false) ? When pacemakerd dies, mode A) which behaves in an existing way. (default) mode B) which makes the node UNCLEAN. Best Regards, Kazunori INOUE Making stop work when there is no pacemakerd process is a different matter. We can make that work. Though the best solution is to relaunch pacemakerd, if it is difficult, I think that a shortcut method is to make a node unclean. And now, I tried Upstart a little bit. 1) started the corosync and pacemaker. $ cat /etc/init/pacemaker.conf respawn script [ -f /etc/sysconfig/pacemaker ] { . /etc/sysconfig/pacemaker } exec /usr/sbin/pacemakerd end script $ service co start Starting Corosync Cluster Engine (corosync): [ OK ] $ initctl start pacemaker pacemaker start/running, process 4702 $ ps -ef|egrep 'corosync|pacemaker' root 4695 1 0 17:21 ?00:00:00 corosync root 4702 1 0 17:21 ?00:00:00 /usr/sbin/pacemakerd 4964703 4702 0 17:21 ?00:00:00 /usr/libexec/pacemaker/cib root 4704 4702 0 17:21 ?00:00:00 /usr/libexec/pacemaker/stonithd root 4705 4702 0 17:21 ?00:00:00 /usr/libexec/pacemaker/lrmd 4964706 4702 0 17:21 ?00:00:00 /usr/libexec/pacemaker/attrd 4964707 4702 0 17:21 ?00:00:00 /usr/libexec/pacemaker/pengine 4964708 4702 0 17:21 ?00:00:00 /usr/libexec/pacemaker/crmd 2) killed pacemakerd. $ pkill -9 pacemakerd $ ps -ef|egrep 'corosync|pacemaker' root 4695 1 0 17:21 ?00:00:01 corosync 4964703 1 0 17:21 ?00:00:00 /usr/libexec/pacemaker/cib root 4704 1 0 17:21 ?00:00:00 /usr/libexec/pacemaker/stonithd root 4705 1 0 17:21 ?00:00:00 /usr/libexec/pacemaker/lrmd 4964706 1 0 17:21 ?00:00:00 /usr/libexec/pacemaker/attrd 4964707 1 0 17:21 ?00:00:00 /usr/libexec/pacemaker/pengine 4964708 1 0 17:21 ?00:00:00 /usr/libexec/pacemaker/crmd root 4760 1 1 17:24 ?00:00:00 /usr/sbin/pacemakerd 3) then I stopped pacemakerd. however, some processes did not stop. $
Re: [Pacemaker] Listing resources by attributes
On Thu, Dec 13, 2012 at 1:09 AM, pavan tc pavan...@gmail.com wrote: Hi, Is there a way in which resources can be listed based on some attributes? For example, listing resource running on a certain node, or listing ms resources. The crm_resource manpage talks about the -N and -t options that seem to address the requirements above. Not really. They're not designed to work with --list or --locate But they do not provide the expected result. crm_resource --list or crm_resource --list-raw give the same output immaterial of whether it was provided with -N or -t. I had to do the following to pull out 'ms' resources, for example: crm configure show | grep -w ^ms | awk '{print $2}' Is there a cleaner way to list resources? Not really. Thanks, Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Action from a different CRMD transition results in restarting services
On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef ylatr...@broadviewnet.com wrote: Hi, I run into the following issue and I couldn’t find what it really means: Detected action msgbroker_monitor_1 from a different transition: 16048 vs. 18014 18014 is where we're up to now, 16048 is the (old) one that scheduled the recurring monitor operation. I suspect you'll find the action failed earlier in the logs and thats why it needed to be restarted. Not the best log message though :( I can see that its impact is to stop/start a service but I’d like to understand it a bit more. Thank you in advance for any information. Logs about this issue: … Dec 6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: Detected action msgbroker_monitor_1 from a different transition: 16048 vs. 18014 Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: process_graph_event:477 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=msgbroker_monitor_1, magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692, cib=0.971.5) : Old event Dec 6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating failcount for msgbroker on Node0 after failed monitor: rc=7 (update=value++, time=1354852505) Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069: Requesting the current CIB: S_POLICY_ENGINE Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair, id=status-Node0-fail-count-msgbroker, magic=NA, cib=0.971.6) : Transient attribute: update Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070: Requesting the current CIB: S_POLICY_ENGINE Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair, id=status-Node0-last-failure-msgbroker, magic=NA, cib=0.971.7) : Transient attribute: update Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071: Requesting the current CIB: S_POLICY_ENGINE Dec 6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating hash entry for last-failure-msgbroker Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback: Invoking the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12, quorate=1 Dec 6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: Operation txpublisher_monitor_0 found resource txpublisher active on Node1 Dec 6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing failed op msgbroker_monitor_1 on Node0: not running (7) … Dec 6 22:55:05 Node1 pengine: [5233]: notice: common_apply_stickiness: msgbroker can fail 99 more times on Node0 before being forced off … Dec 6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp: Start recurring monitor (10s) for msgbroker on Node0 … Dec 6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover msgbroker (Started Node0) … Dec 6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating action 37: stop msgbroker_stop_0 on Node0 Transition 18014 details: Dec 6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message: Transition 18014: PEngine Input stored in: /var/lib/pengine/pe-input-3270.bz2 Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Dec 6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked transition 18014: 0 actions in 0 synapses Dec 6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing graph 18014 (ref=pe_calc-dc-1354852338-39406) derived from /var/lib/pengine/pe-input-3270.bz2 Dec 6 22:52:18 Node1 crmd: [5235]: info: run_graph: Dec 6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition 18014 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-3270.bz2): Complete Dec 6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition 18014 is now complete Dec 6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition 18014 status: done - null Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: Starting PEngine Recheck Timer Youssef ___ Pacemaker mailing
Re: [Pacemaker] gfs2 / dlm on centos 6.2
I see, thanks very much for pointing me in the right direction! Xavier Lashmar Université d'Ottawa / University of Ottawa Analyste de Systèmes | Systems Analyst Service étudiants, service de l'informatique et des communications | Student services, computing and communications services. 1 Nicholas Street (810) Ottawa ON K1N 7B7 Tél. | Tel. 613-562-5800 (2120) From: Andrew Beekhof [and...@beekhof.net] Sent: Tuesday, December 11, 2012 9:30 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] gfs2 / dlm on centos 6.2 On Wed, Dec 12, 2012 at 1:29 AM, Xavier Lashmar xlash...@uottawa.camailto:xlash...@uottawa.ca wrote: Hello, We are attempting to mount gfs2 partitions on CentOS using DRBD + COROSYNC + PACEMAKER. Unfortunately we consistently get the following error: You'll need to configure pacemaker to use cman for this. See: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/ch08s02.html # mount /dev/vg_data/lv_data /webdata/ -t gfs2 -v mount /dev/dm-2 /webdata parse_opts: opts = rw clear flag 1 for rw, flags = 0 parse_opts: flags = 0 parse_opts: extra = parse_opts: hostdata = parse_opts: lockproto = parse_opts: locktable = gfs_controld join connect error: Connection refused error mounting lockproto lock_dlm We are trying to find out where to get the lock_dlm libraries and packages for Centos 6.2 and 6.3 Also, I found that the document “Pacemaker 1.1 - Clusters from Scratch” the Fedora 17 version is a bit problematic. I’m also running a Fedora 17 system and found no package “dlm” as per the instructions in section 8.1.1 yum install -y gfs2-utils dlm kernel-modules-extra Any idea if an external repository is needed? If so, which one ? and which package do we need to install for CentOS 6+ ? Thanks very much [Description: Description: cid:D85E51EA-D618-4CBC-9F88-34F696123DED] Xavier Lashmar Analyste de Systèmes | Systems Analyst Service étudiants, service de l'informatique et des communications/Student services, computing and communications services. 1 Nicholas Street (810) Ottawa ON K1N 7B7 Tél. | Tel. 613-562-5800 (2120) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.orgmailto:Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org inline: image003.pnginline: image001.pnginline: image002.png___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of started on pacemaker before booth write ticket info in cib.
Hi Jiaju, 2012/12/12 Jiaju Zhang jjzh...@suse.de: On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote: Hi Jiaju, Currently, booth is the state of started on pacemaker before booth writes ticket information in cib. So, If the old ticket information is included in cib, a resource relating to the ticket may start before booth resets the ticket. I think that this problem is when to be daemon in booth. The resouce should not be started before the booth daemon is ready. We suggest to configure an ordering constraint for the booth daemon and the managed resources by that ticket. That being said, if the ticket is in the CIB but booth daemon has not been started, the resources would not be started. booth RA finishes booth_start when booth changed the daemon from the foreground process.(To be exact, sleep 1 is included). The current booth change daemon before catchup. On the other hand, the previous booth change daemon after catchup. catchup write a ticket in cib. Even if an ordering constraint is set, as shown below, the related resource can start when booth changes the state of started on pacemaker. At this point, the current booth still may not finish catchup. crm_mon paste. ... booth(ocf::pacemaker:booth-site):Started multi-site-a-1 ... Perhaps, this problem didn't happen before the following commit. https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f Currently when all of the initialization (including loading the new ticket information) finished, booth should be regarded as ready. So if you encounter some problem here, I guess we should improve the RA to better reflect the booth startup status, but not moving the initialization order, since it may introduce other regression as we have encountered before;) I am not still sure which we should fix RA or booth. Thanks, Jiaju Sincerely, Yuichi -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org