Re: [Pacemaker] Pacemaker Corosync Issue
On 16 Oct 2014, at 6:33 pm, Sahil Aggarwal sahilaggarw...@gmail.com wrote: Hello , Yes that log might be due to that reason but , it should not ignore the resource as it is not taking any action for that resource i..e. not starting the resource . it doesn't know that at the time and second thing generally ignoring expired failure log comes as notice: unpack_rsc_op: Ignoring expired failure Server_last_failure_0 but in case where service is ignored , log comes as notice: unpack_rsc_op: Ignoring expired failure (calculated) Server_last_failure_0 this might be some another case. possibly in the old code, but the latest has them combined Please Suggest . On Thu, Oct 16, 2014 at 2:38 AM, Andrew Beekhof and...@beekhof.net wrote: You don't think that might be a little short? Any failure that happened more than 10s is going to be ignored, leading to the pengine message you saw. On 16 Oct 2014, at 12:21 am, Sahil Aggarwal sahilaggarw...@gmail.com wrote: failure timeout for resource is 10s. On Wed, Oct 15, 2014 at 2:51 AM, Andrew Beekhof and...@beekhof.net wrote: On 15 Oct 2014, at 4:23 am, Sahil Aggarwal sahilaggarw...@gmail.com wrote: Hello Team Pacemaker, I am facing a constant issue with Pacemaker, it does not restart the Service even when he knows that the Service is down. It generates a message saying Ignoring Expired Failure for the service. What is the failure timeout set to? Pacemaker and Corosync version are given below. OS CentOS 6.2 corosync-1.4.1-4.el6_2.2.x86_64 pacemaker-1.1.9-2.el6.x86_64 Log which pengine provide is: pengine[45232]: notice: unpack_rsc_op: Ignoring expired failure (calculated) Server_last_failure_0 (rc=7, magic=0:7;14:5699:0:459093cc-f3a1-483b-b853-53a1d9791361) Some more info is: 1.This is a two node cluster. There is time difference of 10 min b/w the two nodes. -- Regards, Sahil Mobile - 09467607999 fbAddress-www.facebook.com/SahilAggarwalg -- Sahil Mobile - 09467607999 fbAddress-www.facebook.com/SahilAggarwalg -- Sahil Mobile - 09467607999 fbAddress-www.facebook.com/SahilAggarwalg signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker Corosync Issue
On 16 Oct 2014, at 7:56 pm, Sahil Aggarwal sahilaggarw...@gmail.com wrote: Sorry, i didn't get your point and i am again re-iterating the problem: Two Node cluster Node A , Node B . Service X running on Node A, Node B is DC. We are using stack corosync with Pacemaker. Failure Timeout is 10 sec . Target-Role is started . Events happens like this • Node A sends event to Node B Service X is down • Node B prints Ignoring expired failure for Service X • After this Service X is never restarted by the Cluster. Now questions are: • Why is Node B (DC) ignoring the expired failure? Because you told it to • Even for this time DC ignored but as the Service X is down, Node A should monitor the service and again send failure status to Node B and at that time Node B should restart the service. Why this no hapenning? For FAILURE TIMEOUT: my understanding is: • Node A sends Failure event of Service X to Node B(DC) at time T and failcount of Service X on Node A reached infinity and Node A is the only node where Service X can run • Now Node B (DC) will after T+FailureTimeoutSecounds will set the failcount of Service X on Node A to Zero and again restart the Service X on Node A. As per you Node B will ignore the Service X failure on Node A after Failure Timeout seconds. From which point Node B starts calculating those seconds?? On Thu, Oct 16, 2014 at 1:07 PM, Andrew Beekhof and...@beekhof.net wrote: On 16 Oct 2014, at 6:33 pm, Sahil Aggarwal sahilaggarw...@gmail.com wrote: Hello , Yes that log might be due to that reason but , it should not ignore the resource as it is not taking any action for that resource i..e. not starting the resource . it doesn't know that at the time and second thing generally ignoring expired failure log comes as notice: unpack_rsc_op: Ignoring expired failure Server_last_failure_0 but in case where service is ignored , log comes as notice: unpack_rsc_op: Ignoring expired failure (calculated) Server_last_failure_0 this might be some another case. possibly in the old code, but the latest has them combined Please Suggest . On Thu, Oct 16, 2014 at 2:38 AM, Andrew Beekhof and...@beekhof.net wrote: You don't think that might be a little short? Any failure that happened more than 10s is going to be ignored, leading to the pengine message you saw. On 16 Oct 2014, at 12:21 am, Sahil Aggarwal sahilaggarw...@gmail.com wrote: failure timeout for resource is 10s. On Wed, Oct 15, 2014 at 2:51 AM, Andrew Beekhof and...@beekhof.net wrote: On 15 Oct 2014, at 4:23 am, Sahil Aggarwal sahilaggarw...@gmail.com wrote: Hello Team Pacemaker, I am facing a constant issue with Pacemaker, it does not restart the Service even when he knows that the Service is down. It generates a message saying Ignoring Expired Failure for the service. What is the failure timeout set to? Pacemaker and Corosync version are given below. OS CentOS 6.2 corosync-1.4.1-4.el6_2.2.x86_64 pacemaker-1.1.9-2.el6.x86_64 Log which pengine provide is: pengine[45232]: notice: unpack_rsc_op: Ignoring expired failure (calculated) Server_last_failure_0 (rc=7, magic=0:7;14:5699:0:459093cc-f3a1-483b-b853-53a1d9791361) Some more info is: 1.This is a two node cluster. There is time difference of 10 min b/w the two nodes. -- Regards, Sahil Mobile - 09467607999 fbAddress-www.facebook.com/SahilAggarwalg -- Sahil Mobile - 09467607999 fbAddress-www.facebook.com/SahilAggarwalg -- Sahil Mobile - 09467607999 fbAddress-www.facebook.com/SahilAggarwalg -- Sahil Mobile - 09467607999 fbAddress-www.facebook.com/SahilAggarwalg signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Linux HA setup for CentOS 6.5
Thanks! OK, so I've followed the DRBD steps in the guide all the way till cib commit fs in Section 7.4, right before Testing Migration. However, when I do a crm_mon, I get the following failed actions. Last updated: Thu Oct 16 17:28:34 2014 Last change: Thu Oct 16 17:26:04 2014 via crm_shadow on node01 Stack: cman Current DC: node02 - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 2 Nodes configured 5 Resources configured Online: [ node01 node02 ] ClusterIP(ocf::heartbeat:IPaddr2):Started node02 Master/Slave Set: WebDataClone [WebData] Masters: [ node02 ] Slaves: [ node01 ] WebFS (ocf::heartbeat:Filesystem):Started node02 Failed actions: WebSite_start_0 on node02 'unknown error' (1): call=278, status=Timed Out, l ast-rc-change='Thu Oct 16 17:26:28 2014', queued=2ms, exec=0ms WebSite_start_0 on node01 'unknown error' (1): call=203, status=Timed Out, l ast-rc-change='Thu Oct 16 17:26:09 2014', queued=2ms, exec=0ms Seems like the apache Website resource isn't starting up. Apache was working just fine before I configured DRBD. What did I do wrong? On Thu, Oct 16, 2014 at 1:49 PM, Digimer li...@alteeve.ca wrote: On 16/10/14 12:14 AM, Sihan Goi wrote: After following the guide, I've successfully managed to get Apache server up and running in the cluster as an active/passive setup, but with some differences. My cluster stack is stated as being cman while the guide's is openais. Not sure if that's a problem. Also, some commands in the guide don't seem to work. If you can provide examples of what issues you're having, I will be happy to try an help. I'm moving on to DRBD installation now, but when I do a yum install drbd-pacemaker drbd-udev, these packages are not available. After some googling, it seems that drbd83-utils/kmod-drbd83 or drbd84-utils/kmod-drbd84 is available via another repo. Does this work with the guide? You need to get them from a 3rd party repo (or install from source). I personally still use 8.3.16 (consistency during Anvil! generations), but I know that 8.4 is fine on EL6 (and EL7, to address an earlier comment). I have my own repos with these packages, but you would likely be better served using the ELRepo ones. https://alteeve.ca/w/AN!Cluster_Tutorial_2#Installing_DRBD The only real difference is to s/83/84/: + yum install drbd84-utils kmod-drbd84 - yum install drbd83-utils kmod-drbd83 If you run into any troubles, please share details and I am sure we'll get you sorted out in no time. Cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- - Goi Sihan gois...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Stopping/restarting pacemaker without stopping resources?
The primary goal is to transparently update software in cluster. I just did HA suite update using simple RPM and observed that RPM attempts to restart stack (rcopenais try-restart). So a) if it worked, it would mean resources had been migrated from this node - interruption b) it did not work - apparently new versions of installed utils were incompatible with running pacemaker so request to shutdown crm fails and openais hung forever. The usual workflow with one cluster products I worked before was - stop cluster processes without stopping resources; update; restart cluster processes. They would detect that resources are started and return to the same state as before stopping. Is something like this possible with pacemaker? TIA -andrei ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] meta failure-timeout: crashed resource is assumed to be Started?
Dear all, I configured meta failure-timeout=60sec on all of my resources. For the sake of simplicity, assume I have a group of two resources FIRST and SECOND (where SECOND is started after FIRST, surprise!). If now FIRST crashes, I see a failure, as expected. I also see that SECOND is stopped, as expected. Sadly, SECOND needs more than 60 seconds to stop. Thus, it can happen that the failure-timeout for FIRST is reached, and its failure is cleaned. This also is expected. The problem now is that after the 60sec timeout pacemaker assumes that FIRST is in the Started state. There is no indication about that in the log files, and the last monitor operation which ran just a few seconds before also indicated that FIRST is actually not running. As a consequence of the bug, pacemaker tries to re-start SECOND on the same system, which fails to start (as it depends on FIRST, which actually is not running). Only then the resources are started on the other system. So, my question is: Why does pacemaker assume that a previously failed resource is Started when the meta failure-timeout is triggered? Why is the monitor operation not invoked to determine the correct state? The corresponding lines of the log file, about a minute after FIRST crashed and the stop operation for SECOND was triggered: Oct 16 16:27:20 [2100] HOSTNAME [...] (monitor operation indicating that FIRST is not running) [...] Oct 16 16:27:23 [2104] HOSTNAME lrmd: info: log_finished: finished - rsc:SECOND action:stop call_id:123 pid:29314 exit-code:0 exec-time:62827ms queue-time:0ms Oct 16 16:27:23 [2107] HOSTNAME crmd: notice: process_lrm_event:LRM operation SECOND_stop_0 (call=123, rc=0, cib-update=225, confirmed=true) ok Oct 16 16:27:23 [2107] HOSTNAME crmd: info: match_graph_event: Action SECOND_stop_0 (74) confirmed on HOSTNAME (rc=0) Oct 16 16:27:23 [2107] HOSTNAME crmd: notice: run_graph:Transition 40 (Complete=5, Pending=0, Fired=0, Skipped=31, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-2937.bz2): Stopped Oct 16 16:27:23 [2107] HOSTNAME crmd: info: do_state_transition: State transition S_TRANSITION_ENGINE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] Oct 16 16:27:23 [2100] HOSTNAMEcib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/225, version=0.1450.89) Oct 16 16:27:23 [2100] HOSTNAMEcib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/226, version=0.1450.89) Oct 16 16:27:23 [2106] HOSTNAMEpengine: notice: unpack_config:On loss of CCM Quorum: Ignore Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: determine_online_status_fencing: Node HOSTNAME is active Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: determine_online_status: Node HOSTNAME is online [...] Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: get_failcount_full: FIRST has failed 1 times on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAMEpengine: notice: unpack_rsc_op: Clearing expired failcount for FIRST on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: get_failcount_full: FIRST has failed 1 times on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAMEpengine: notice: unpack_rsc_op: Clearing expired failcount for FIRST on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: get_failcount_full: FIRST has failed 1 times on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAMEpengine: notice: unpack_rsc_op: Clearing expired failcount for FIRST on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAMEpengine: notice: unpack_rsc_op: Re-initiated expired calculated failure FIRST_last_failure_0 (rc=7, magic=0:7;68:31:0:28c68203-6990-48fd-96cc-09f86e2b21f9) on HOSTNAME [...] Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: group_print: Resource Group: GROUP Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: native_print: FIRST (ocf::heartbeat:xxx): Started HOSTNAME Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: native_print: SECOND (ocf::heartbeat:yyy):Stopped Thank you, Carsten -- andrena objects ag Büro Frankfurt Clemensstr. 8 60487 Frankfurt Tel: +49 (0) 69 977 860 38 Fax: +49 (0) 69 977 860 39 http://www.andrena.de Vorstand: Hagen Buchwald, Matthias Grund, Dr. Dieter Kuhn Aufsichtsratsvorsitzender: Rolf Hetzelberger Sitz der Gesellschaft: Karlsruhe Amtsgericht Mannheim, HRB 109694 USt-IdNr. DE174314824 Bitte beachten Sie auch unsere anstehenden Veranstaltungen: http://www.andrena.de/events signature.asc Description: Digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started:
Re: [Pacemaker] Raid RA Changes to Enable ms configuration -- need some assistance plz.
Andrew Beekhof andrew@... writes: Yes. If you want the cluster to start things in a particular order, then you need to specify it. Andrew, but my issue isn't getting the resources to start in a a specific order. My issue is that I can't get the slave resource to get promoted when the previous master goes offline. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org