[ClusterLabs] Antw: Re: Antw: Delayed first monitoring
Miloš Kozák milos.ko...@lejmr.com schrieb am 13.08.2015 um 09:56 in Nachricht 55cc4daa.4020...@lejmr.com: Dne 13.8.2015 v 09:26 Andrei Borzenkov napsal(a): On Thu, Aug 13, 2015 at 10:01 AM, Miloš Kozák milos.ko...@lejmr.com wrote: However, this does not make sense at all. Presumably, the pacemaker should get along with lsb scripts which comes from system repository, right? Let's forget about pacemaker for a moment. You have system startup where service B needs service A. initscript for service A completes and script for service B is started but service A is not yet ready to be used. This is a bug in startup script. Irrespectively of whether you use it with pacemaker or not. I am sorry, but I didnt get the point.. If service A is not ready then service B should not be started. As you seem to be ignorant for advice: Yes, you are right: Service B should check whether service A is up before starzing itself. The easy change for the start script of B is to find aout what command was run before it to check whether the command before did everything OK by checking again itself. [...] ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Ordering constraint restart second resource group
On Thu, Aug 13, 2015 at 11:25 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: And what exactly is your problem? Real life example. Database resource depends on storage resource(s). There are multiple filesystems/volumes with database files. Database admin needs to increase available space. You add new storage, configure it in cluster ... pooh, your database is restarted. There is zero need to restart database because it does not even use new resource yet. I do the above routinely with other cluster implementation without any visible impact. If you change a resource, it will be restrted, and if a resource is restarted, constraints will be followed... Despite of that: If I understand your configuration correctly, it's very much the same as resource_group ip1 ip2 apache1 Regards, Ulrich John Gogu ionut.g...@gmail.com schrieb am 12.08.2015 um 18:35 in Nachricht CAMESV9DUj3owj16oT5DSYjxZWeZX1f5wV63=muyta3vv0kk...@mail.gmail.com: Hello, in my cluster configuration I have following situation: resource_group_A ip1 ip2 resource_group_B apache1 ordering constraint resource_group_A then resource_group_B symetrical=true When I add a new resource from group_A, resources from group_B are restarted. If I remove constraint all ok but I need to keep this ordering constraint. John ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: Ordering constraint restart second resource group
Andrei Borzenkov arvidj...@gmail.com schrieb am 13.08.2015 um 11:33 in Nachricht CAA91j0WaYcPPNCtMZnwDz4_QDFWgxPrO6DbB=ga3bv+_ooo...@mail.gmail.com: On Thu, Aug 13, 2015 at 11:25 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: And what exactly is your problem? Real life example. Database resource depends on storage resource(s). There are multiple filesystems/volumes with database files. Database admin needs to increase available space. You add new storage, configure it in cluster ... pooh, your database is restarted. There is zero need to restart database because it does not even use new resource yet. I do the above routinely with other cluster implementation without any visible impact. Hi! So maybe that's why planning usually is done before implementation. We had similar problems with our NFS exports, and we redesigned to speed up the resource start and stop times, as well as allowing changes without restarting the NFS server (and most importantly) local (dependent) client resources... What you could do: Add your new IP resource with a location constraint only (it will be started on the right node then). Then put the IP resource into the group, and the cluster will see that every resource is in the desired state, and nothing will be restarted. Regards, Ulrich If you change a resource, it will be restrted, and if a resource is restarted, constraints will be followed... Despite of that: If I understand your configuration correctly, it's very much the same as resource_group ip1 ip2 apache1 Regards, Ulrich John Gogu ionut.g...@gmail.com schrieb am 12.08.2015 um 18:35 in Nachricht CAMESV9DUj3owj16oT5DSYjxZWeZX1f5wV63=muyta3vv0kk...@mail.gmail.com: Hello, in my cluster configuration I have following situation: resource_group_A ip1 ip2 resource_group_B apache1 ordering constraint resource_group_A then resource_group_B symetrical=true When I add a new resource from group_A, resources from group_B are restarted. If I remove constraint all ok but I need to keep this ordering constraint. John ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] stonithd: stonith_choose_peer: Couldn't find anyone to fence node with any
Hi, Brief description of the STONITH problem: I see two different behaviors with two different STONITH configurations. If Pacemaker cannot find a device that can STONITH a problematic node, the node remains up and running. Which is bad, because it must be STONITHed. As opposite to it, if Pacemaker finds a device that, it thinks, can STONITH a problematic node, even if the device actually cannot, Pacemaker goes down after STONITH returns false positive. The Pacemaker shutdowns itself right after STONITH. Is it the expected behavior? Do I need to configure a two more STONITH agents for just rebooting nodes on which they are running (e.g. with # reboot -f)? +- + Set-up: +- - two node cluster (node-0 and node-1); - two fencing (STONITH) agents are configured (STONITH_node-0 and STONITH_node-1). - STONITH_node-0 runs only on node-1 // this fencing agent can only fence node-0 - STONITH_node-1 runs only on node-0 // this fencing agent can only fence node-1 +- + Environment: +- - one node - node-0 - is up and running; - one STONITH agent - STONITH_node-1 - is up and running +- + Test case: +- Simulate error of stopping a resource. 1. start cluster 2. change a RA's script to return $OCF_ERR_GENERIC from Stop function. 3. stop the resource by # crm resource stop resource +- + Actual behavior: +- CASE 1: STONITH is configured with: # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw \ params pcmk_host_list=node-1 pcmk_host_check=static-list After issuing a stop command: - the resource changes its state to FAILED - Pacemaker remains working See below LOG_snippet_1 section. CASE 2: STONITH is configured with: # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw After issuing a stop command: - the resource changes its state to FAILED - Pacemaker stops working See below LOG_snippet_2 section. +- + LOG_snippet_1: +- Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: handle_request: Client crmd.39210.fa40430f wants to fence (reboot) 'node-0' with device '(any)' Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node-0: 18cc29db-b7e4-4994-85f1-df891f091a0d (0) Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: can_fence_host_with_device: STONITH_node-1 can not fence (reboot) node-0: static-list Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: stonith_choose_peer:Couldn't find anyone to fence node-0 with any Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: info: call_remote_stonith:Total remote op timeout set to 60 for fencing of node node-0 for crmd.39210.18cc29db Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: info: call_remote_stonith:None of the 1 peers have devices capable of terminating node-0 for crmd.39210 (0) Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: warning: get_xpath_object: No match for //@st_delegate in /st-reply Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd:error: remote_op_done: Operation reboot of node-0 by node-0 for crmd.39210@node-0.18cc29db: No such device Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: tengine_stonith_callback: Stonith operation 3/23:16:0:0856a484-6b69-4280-b93f-1af9a6a542ee: No such device (-19) Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: tengine_stonith_callback: Stonith operation 3 for node-0 failed (No such device): aborting transition. Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: info: abort_transition_graph: Transition aborted: Stonith failed (source=tengine_stonith_callback:697, 0) Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: tengine_stonith_notify: Peer node-0 was not terminated (reboot) by node-0 for node-0: No such device +- + LOG_snippet_2: +- Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: handle_request: Client crmd.9009.cabd2154 wants to fence (reboot) 'node-0' with device '(any)' Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node-0: 3b06d3ce-b100-46d7-874e-96f10348d9e4 (0) Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: can_fence_host_with_device: STONITH_node-1 can fence (reboot) node-0: none Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: call_remote_stonith: Total remote op timeout set to 60 for fencing of node node-0 for crmd.9009.3b06d3ce Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: call_remote_stonith: Requesting that node-0 perform op reboot node-0 for crmd.9009 (72s) Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd:
Re: [ClusterLabs] Antw: Re: Antw: Delayed first monitoring
On 13/08/15 10:38 +0200, Ulrich Windl wrote: Miloš Kozák milos.ko...@lejmr.com schrieb am 13.08.2015 um 09:56 in Nachricht 55cc4daa.4020...@lejmr.com: Dne 13.8.2015 v 09:26 Andrei Borzenkov napsal(a): On Thu, Aug 13, 2015 at 10:01 AM, Miloš Kozák milos.ko...@lejmr.com wrote: However, this does not make sense at all. Presumably, the pacemaker should get along with lsb scripts which comes from system repository, right? Let's forget about pacemaker for a moment. You have system startup where service B needs service A. initscript for service A completes and script for service B is started but service A is not yet ready to be used. This is a bug in startup script. Irrespectively of whether you use it with pacemaker or not. I am sorry, but I didnt get the point.. If service A is not ready then service B should not be started. As you seem to be ignorant for advice: Yes, you are right: Service B should check whether service A is up before starzing itself. The easy change for the start script of B is to find aout what command was run before it to check whether the command before did everything OK by checking again itself. [...] The harder task for the sketched, relaxed (not strictly serialized, at least per prerequisite-ordering) environment is for service B aware of its prerequisite-ordered predecessor A to (also) decide if A is not by any chance just proceeding with a startup sequence -- something requiring a very detailed knowledge of its internals and being prone to race-conditions anyway. Hence reasonable, high-level, init systems require such startup sequences to be completely finished by the time they acknowledge service at hand as started and allow prerequisite-ordered successor to join the game too. Consequently, the responsibility for such is finished with startup (successfully or not)? is deferred to the lower-level dedicated startup recipes that should then signal this back to the init system (e.g., by finishing only when the startup is over) credibly to prevent mess ups. Going full circle, if such assumption is broken in httpd initscript, it should be fixed. -- Jan (Poki) pgpYUzkQvpoLi.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] implementation of fence and stonith agents for pacemaker
Digimer, Thank you. I will try this out. One more question. What about directories for those agents, what rules are here? Thank you, Kostya On Tue, Aug 11, 2015 at 6:21 PM, Digimer li...@alteeve.ca wrote: On 11/08/15 11:17 AM, Kostiantyn Ponomarenko wrote: Hi guys, Is there any documentation which describes implementation of fence and STONITH agents like those ones for Resource Agents?: http://www.linux-ha.org/wiki/OCF_Resource_Agents http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html I am particular interested in the arguments which are passed to a stonith resource by stonithd. Is there any guidelines what arguments it has to handle and where it must be put (which directories are allowed)? So far I found this http://linux.die.net/man/7/stonithd . But for example, it is not clear for me how pcmk_host_check=dynamic-list which is (query the device) works. Do I need to handle some action in my stonith agent for that parameter? Thank you, Kostya This is the API; https://fedorahosted.org/cluster/wiki/FenceAgentAPI It needs to be updated to reflect the need for agents to output the XML metadata. For now, you should be able to see the format needed by looking at the metadata output of existing FAs. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] stonithd: stonith_choose_peer: Couldn't find anyone to fence node with any
Then make sure it can be stonithd. Add additional stonith agent using independent communication channel. Not possible. Only one node up and running in the cluster and I am wondering - can it STONITH itself? Because most likely, after reboot, the problem can be gone. I have no idea what fence_sbb_hw is or does That just reboots the peer. It is our specific STONITH agent. What this node does by itself really does not matter. What if at some point there is only one node in the cluster? In the solution am I working on there are two nodes form the cluster. And it is possible to use this solution even with only one node. I am satisfied with the CASE 2 where Pacemaker shutdowns itself after calling STONITH, despite that stonith agent didn't reboot the needed node but returned false positive. The only question is why this doesn't happen in CASE 1? Thank you, Kostya ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] question on www.gossamer-threads.com/lists/linuxha/users/
Hi, I noticed that after moving to the new mailing list there is no more updates here: http://www.gossamer-threads.com/lists/linuxha/users/ Can it be fixed or am I missing something? I was a convenient way of searching/reading/tracking issues. Thank you, Kostya ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] question on www.gossamer-threads.com/lists/linuxha/users/
DOH! Please ignore my mail - i live in the past ;) Last mail in archive is from Jul. Stefan ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: stonithd: stonith_choose_peer: Couldn't find anyone to fence node with any
Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com schrieb am 13.08.2015 um 13:39 in Nachricht caenth0fxlzwzw4jmoyk_go0w9o6e2gdd-zfdfohzrahwcgv...@mail.gmail.com: Hi, Brief description of the STONITH problem: I see two different behaviors with two different STONITH configurations. If Pacemaker cannot find a device that can STONITH a problematic node, the node remains up and running. Which is bad, because it must be STONITHed. Correct observation. I wonder whether cloning a STONITH resource would help; for a symmetric STONITH like SBD any node can fence any other node at the same time. Still pacemaker waits for the stonith resource (wich is something different than SBD) is confirmed running on one node (hard to get if one node with the STONITH resource in a two-node cluster went down unexpectedly). As opposite to it, if Pacemaker finds a device that, it thinks, can STONITH a problematic node, even if the device actually cannot, Pacemaker goes down after STONITH returns false positive. The Pacemaker shutdowns itself right after STONITH. Is it the expected behavior? I'd surprised if it were. Do I need to configure a two more STONITH agents for just rebooting nodes on which they are running (e.g. with # reboot -f)? Good question ;-) [...] ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Delayed first monitoring
On 13/08/15 04:38 AM, Ulrich Windl wrote: Miloš Kozák milos.ko...@lejmr.com schrieb am 13.08.2015 um 09:56 in Nachricht 55cc4daa.4020...@lejmr.com: Dne 13.8.2015 v 09:26 Andrei Borzenkov napsal(a): On Thu, Aug 13, 2015 at 10:01 AM, Miloš Kozák milos.ko...@lejmr.com wrote: However, this does not make sense at all. Presumably, the pacemaker should get along with lsb scripts which comes from system repository, right? Let's forget about pacemaker for a moment. You have system startup where service B needs service A. initscript for service A completes and script for service B is started but service A is not yet ready to be used. This is a bug in startup script. Irrespectively of whether you use it with pacemaker or not. I am sorry, but I didnt get the point.. If service A is not ready then service B should not be started. As you seem to be ignorant for advice: Ok, I'm starting to get annoyed now. You need to be more polite and respectful on this list. Yes, you are right: Service B should check whether service A is up before starzing itself. The easy change for the start script of B is to find aout what command was run before it to check whether the command before did everything OK by checking again itself. [...] ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] implementation of fence and stonith agents for pacemaker
On 13/08/15 07:54 AM, Kostiantyn Ponomarenko wrote: Digimer, Thank you. I will try this out. One more question. What about directories for those agents, what rules are here? Thank you, Kostya I'm not entirely sure I understand the question, sorry. What do you mean by directories for those agents? If you're asking about implementation details like language to use, etc, there are no rules. Python and bash are the most common languages, I think, but I write my fence agents in perl just fine. I think a couple are even in C. I suspect that python is the language upstream maintainer are happier with, but as beekhof said in the RA script; the person doing the work gets to make the decisions. :) If I didn't answer your question, please clarify. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] implementation of fence and stonith agents for pacemaker
Sorry, I should be more clear. I mean the place where I must put my agent so it is visible to the cluster. For example I know that you need to put your agent into /usr/sbin/ and start its name with fence_ in order to get it visible to the cluster. So I want to know the rules, are there other places which I also can put my agent in and get it visible to the cluster? Thank you, Kostya On Thu, Aug 13, 2015 at 5:34 PM, Digimer li...@alteeve.ca wrote: On 13/08/15 07:54 AM, Kostiantyn Ponomarenko wrote: Digimer, Thank you. I will try this out. One more question. What about directories for those agents, what rules are here? Thank you, Kostya I'm not entirely sure I understand the question, sorry. What do you mean by directories for those agents? If you're asking about implementation details like language to use, etc, there are no rules. Python and bash are the most common languages, I think, but I write my fence agents in perl just fine. I think a couple are even in C. I suspect that python is the language upstream maintainer are happier with, but as beekhof said in the RA script; the person doing the work gets to make the decisions. :) If I didn't answer your question, please clarify. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] implementation of fence and stonith agents for pacemaker
Thank you for the help :-) On Aug 13, 2015 20:19, Digimer li...@alteeve.ca wrote: Ah, yes. If it's a RHEL/CentOS machine, put it in /usr/sbin/. If it's another OS, locate fence_ipmilan and put your agent in the same directory. digimer On 13/08/15 01:03 PM, Kostiantyn Ponomarenko wrote: Sorry, I should be more clear. I mean the place where I must put my agent so it is visible to the cluster. For example I know that you need to put your agent into /usr/sbin/ and start its name with fence_ in order to get it visible to the cluster. So I want to know the rules, are there other places which I also can put my agent in and get it visible to the cluster? Thank you, Kostya On Thu, Aug 13, 2015 at 5:34 PM, Digimer li...@alteeve.ca mailto:li...@alteeve.ca wrote: On 13/08/15 07:54 AM, Kostiantyn Ponomarenko wrote: Digimer, Thank you. I will try this out. One more question. What about directories for those agents, what rules are here? Thank you, Kostya I'm not entirely sure I understand the question, sorry. What do you mean by directories for those agents? If you're asking about implementation details like language to use, etc, there are no rules. Python and bash are the most common languages, I think, but I write my fence agents in perl just fine. I think a couple are even in C. I suspect that python is the language upstream maintainer are happier with, but as beekhof said in the RA script; the person doing the work gets to make the decisions. :) If I didn't answer your question, please clarify. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org mailto: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: stonithd: stonith_choose_peer: Couldn't find anyone to fence node with any
On 13 Aug 2015, at 11:36 pm, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com schrieb am 13.08.2015 um 13:39 in Nachricht caenth0fxlzwzw4jmoyk_go0w9o6e2gdd-zfdfohzrahwcgv...@mail.gmail.com: Hi, Brief description of the STONITH problem: I see two different behaviors with two different STONITH configurations. If Pacemaker cannot find a device that can STONITH a problematic node, the node remains up and running. Which is bad, because it must be STONITHed. Correct observation. I wonder whether cloning a STONITH resource would help; no for a symmetric STONITH like SBD any node can fence any other node at the same time. Still pacemaker waits for the stonith resource (wich is something different than SBD) is confirmed running on one node (hard to get if one node with the STONITH resource in a two-node cluster went down unexpectedly). As opposite to it, if Pacemaker finds a device that, it thinks, can STONITH a problematic node, even if the device actually cannot, Pacemaker goes down after STONITH returns false positive. The Pacemaker shutdowns itself right after STONITH. Is it the expected behavior? I'd surprised if it were. Do I need to configure a two more STONITH agents for just rebooting nodes on which they are running (e.g. with # reboot -f)? Good question ;-) [...] ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Memory leak in crm_mon ?
-Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, August 11, 2015 2:49 AM To: Cluster Labs - All topics related to open-source clustering welcomed users@clusterlabs.org Subject: Re: [ClusterLabs] Memory leak in crm_mon ? On 10 Aug 2015, at 5:33 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Hi! We are building a new cluster on top of pacemaker/corosync and several times during the past days we noticed that „crm_mon -Af” used up all the memory+swap and caused high CPU usage. Killing the process solves the issue. We are using the binary package versions available in the latest ubuntu trusty, namely: crmsh 1.2.5+hg1034-1ubuntu4 pacemaker 1.1.10+git20130802-1ubuntu2.3 pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3 corosync 2.3.3-1ubuntu1 Kernel is 3.13.0-46-generic Looking back some „atop” data, the CPU went to 100% many times during the last couple of days, at various times, more often around midnight exaclty (strange). 08.05 14:00 08.06 21:41 08.07 00:00 08.07 00:00 08.08 00:00 08.09 06:27 Checked the corosync log and syslog, but did not find any correlation between the entries int he logs around the specific times. For most of the time, the node running the crm_mon was the DC as well – not running any resources (e.g. a pairless node for quorum). We have another running system, where everything works perfecly, whereas it is almost the same: crmsh 1.2.5+hg1034-1ubuntu4 pacemaker 1.1.10+git20130802-1ubuntu2.1 pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 corosync 2.3.3-1ubuntu1 Kernel is 3.13.0-8-generic Is this perhaps a known issue? Possibly, that version is over 2 years old. Any hints? Getting something a little more recent would be the best place to start Thanks Andew, I tried to upgrade to 1.1.12 using the packages availabe at https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a single node, to see how it works out but I ended up with errors like Could not establish cib_rw connection: Connection refused (111) I have disabled the firewall, no changes. The node appears to be running but does not see any of the other nodes. On the other nodes I see this node as an UNCLEAN one. (I assume corosync is fine, but pacemaker not) I use udpu for the transport. Am I doing something wrong? I tried to look for some howtos on upgrade, but the only thing I found was the rather outdated http://clusterlabs.org/wiki/Upgrade Could you please direct me to some howto/guide on how to perform the upgrade? Or am I facing some compatibility issue, so I should extract the whole cib, upgrade all nodes and reconfigure the cluster from the scratch? (The cluster is meant to go live in 2 days... :) ) Thanks a lot in advance Thanks! ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org