Re: [ClusterLabs] Pacemaker fatal shutdown
SA action flags 0x0020 (A_INTEGRATE_TIMER_STOP) for controller set by do_state_transition:559 63835:Jul 17 14:16:55.092 FILE-2 pacemaker-controld [15962] (pcmk__set_flags_as) debug: FSA action flags 0x0080 (A_FINALIZE_TIMER_STOP) for controller set by do_state_transition:565 63836:Jul 17 14:16:55.092 FILE-2 pacemaker-controld [15962] (pcmk__clear_flags_as) debug: FSA action flags 0x0200 (an_action) for controller cleared by do_fsa_action:108 63837:Jul 17 14:16:55.092 FILE-2 pacemaker-controld [15962] (pcmk__clear_flags_as) debug: FSA action flags 0x0020 (an_action) for controller cleared by do_fsa_action:108 63838:Jul 17 14:16:55.092 FILE-2 pacemaker-controld [15962] (pcmk__clear_flags_as) debug: FSA action flags 0x0080 (an_action) for controller cleared by do_fsa_action:108 63863:Jul 17 14:17:25.073 FILE-2 pacemaker-controld [15962] (throttle_cib_load)debug: cib load: 0.000667 (2 ticks in 30s) 63864:Jul 17 14:17:25.073 FILE-2 pacemaker-controld [15962] (throttle_mode)debug: Current load is 0.65 across 10 core(s) 63865:Jul 17 14:17:55.073 FILE-2 pacemaker-controld [15962] (throttle_cib_load)debug: cib load: 0.000333 (1 ticks in 30s) 63866:Jul 17 14:17:55.073 FILE-2 pacemaker-controld [15962] (throttle_mode)debug: Current load is 0.85 across 10 core(s) 63868:Jul 17 14:18:20.085 FILE-2 pacemaker-fenced[15958] (process_remote_stonith_exec) debug: Finalizing action 'reboot' targeting FILE-2 on behalf of pacemaker-controld.19415@FILE-6: OK | rc=0 id=4e523b34 63869:Jul 17 14:18:20.085 FILE-2 pacemaker-fenced[15958] (remote_op_done) notice: Operation 'reboot' targeting FILE-2 by FILE-4 for pacemaker-controld.19415@FILE-6: OK | id=4e523b34 63872:Jul 17 14:18:20.089 FILE-2 pacemaker-controld [15962] (exec_alert_list) info: Sending fencing alert via pf-ha-alert to (null) 63875:Jul 17 14:18:20.089 FILE-2 pacemaker-controld [15962] (tengine_stonith_notify) crit: We were allegedly just fenced by FILE-4 for FILE-6! 63876:Jul 17 14:18:20.089 FILE-2 pacemaker-controld [15962] (crm_xml_cleanup) info: Cleaning up memory from libxml2 63877:Jul 17 14:18:20.089 FILE-2 pacemaker-controld [15962] (crm_exit) info: Exiting pacemaker-controld | with status 100 63900:Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] (pcmk_child_exit) warning: Shutting cluster down because pacemaker-controld[15962] had fatal failure 63902:Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] (pcmk_shutdown_worker) debug: pacemaker-controld confirmed stopped 63956:Jul 17 14:18:20.101 FILE-2 pacemaker-fenced[15958] (process_remote_stonith_exec) debug: Finalizing action 'reboot' targeting FILE-1 on behalf of pacemaker-controld.19415@FILE-6: OK | rc=0 id=446afc42 63957:Jul 17 14:18:20.101 FILE-2 pacemaker-fenced[15958] (remote_op_done) notice: Operation 'reboot' targeting FILE-1 by FILE-5 for pacemaker-controld.19415@FILE-6: OK | id=446afc42 Thanks Priyanka On Thu, Jul 20, 2023 at 12:07 AM Ken Gaillot wrote: > On Wed, 2023-07-19 at 23:49 +0530, Priyanka Balotra wrote: > > Hi All, > > I am using SLES 15 SP4. One of the nodes of the cluster is brought > > down and boot up after sometime. Pacemaker service came up first but > > later it faced a fatal shutdown. Due to that crm service is down. > > > > The logs from /var/log/pacemaker.pacemaker.log are as follows: > > > > Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] > > (pcmk_child_exit)warning: Shutting cluster down because > > pacemaker-controld[15962] had fatal failure > > The interesting messages will be before this. The ones with "pacemaker- > controld" will be the most relevant, at least initially. > > > Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] > > (pcmk_shutdown_worker) notice: Shutting down Pacemaker > > Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] > > (pcmk_shutdown_worker) debug: pacemaker-controld confirmed stopped > > Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] (stop_child) > > notice: Stopping pacemaker-schedulerd | sent signal 15 to process > > 15961 > > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] > > (crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 > > (invoking handler) > > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] > > (qb_ipcs_us_withdraw)info: withdrawing server sockets > > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] > > (qb_ipcs_unref) debug: qb_ipcs_unref() - destroying > > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] > > (crm_xml_cleanup)info: Cleaning up memory from libxml2 > > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_exit) > > info: Exiting pacemak
[ClusterLabs] Pacemaker fatal shutdown
Hi All, I am using SLES 15 SP4. One of the nodes of the cluster is brought down and boot up after sometime. Pacemaker service came up first but later it faced a fatal shutdown. Due to that crm service is down. The logs from /var/log/pacemaker.pacemaker.log are as follows: Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] (pcmk_child_exit) warning: Shutting cluster down because pacemaker-controld[15962] had fatal failure Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] (pcmk_shutdown_worker) notice: Shutting down Pacemaker Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] (pcmk_shutdown_worker) debug: pacemaker-controld confirmed stopped Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] (stop_child) notice: Stopping pacemaker-schedulerd | sent signal 15 to process 15961 Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 (invoking handler) Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (qb_ipcs_us_withdraw)info: withdrawing server sockets Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (qb_ipcs_unref) debug: qb_ipcs_unref() - destroying Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_xml_cleanup) info: Cleaning up memory from libxml2 Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_exit) info: Exiting pacemaker-schedulerd | with status 0 Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957] (qb_ipcs_event_sendv)debug: new_event_notification (/dev/shm/qb-15957-15962-12-RDPw6O/qb): Broken pipe (32) Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957] (cib_notify_send_one)warning: Could not notify client crmd: Broken pipe | id=e29d175e-7e91-4b6a-bffb-fabfdd7a33bf Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957] (cib_process_request)info: Completed cib_delete operation for section //node_state[@uname='FILE-2']/*: OK (rc=0, origin=FILE-6/crmd/74, version=0.24.75) Jul 17 14:18:20.093 FILE-2 pacemaker-fenced[15958] (xml_patch_version_check)debug: Can apply patch 0.24.75 to 0.24.74 Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] (pcmk_child_exit) info: pacemaker-schedulerd[15961] exited with status 0 (OK) Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957] (cib_process_request)info: Completed cib_modify operation for section status: OK (rc=0, origin=FILE-6/crmd/75, version=0.24.75) Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] (pcmk_shutdown_worker) debug: pacemaker-schedulerd confirmed stopped Jul 17 14:18:20.093 FILE-2 pacemakerd [15956] (stop_child) notice: Stopping pacemaker-attrd | sent signal 15 to process 15960 Jul 17 14:18:20.093 FILE-2 pacemaker-attrd [15960] (crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 (invoking handler) Could you please help me understand the issue here. Regards Priyanka ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
I am using SLES 15 SP4. Is the no-quorum-policy still supported? Thanks Priyanka On Wed, 28 Jun 2023 at 12:46 AM, Ken Gaillot wrote: > On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote: > > In this case stonith has been configured as a resource, > > primitive stonith-sbd stonith:external/sbd > > > > For it to be functional properly , the resource needs to be up, which > > is only possible if the system is quorate. > > Pacemaker can use a fence device even if its resource is not active. > The resource being active just allows Pacemaker to monitor the device > regularly. > > > > > Hence our requirement is to make the system quorate even if one Node > > of the cluster is up. > > Stonith will then take care of any split-brain scenarios. > > In that case it sounds like no-quorum-policy=ignore is actually what > you want. > > > > > Thanks > > Priyanka > > > > On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger > > wrote: > > > > > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov < > > > arvidj...@gmail.com> wrote: > > > > On 27.06.2023 07:21, Priyanka Balotra wrote: > > > > > Hi Andrei, > > > > > After this state the system went through some more fencings and > > > > we saw the > > > > > following state: > > > > > > > > > > :~ # crm status > > > > > Cluster Summary: > > > > >* Stack: corosync > > > > >* Current DC: FILE-2 (version > > > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) > > > > - partition > > > > > with quorum > > > > > > > > It says "partition with quorum" so what exactly is the problem? > > > > > > I guess the problem is that resources aren't being recovered on > > > the nodes in the quorate partition. > > > Reason for that is probably that - as Ken was already suggesting - > > > fencing isn't > > > working properly or fencing-devices used are simply inappropriate > > > for the > > > purpose (e.g. onboard IPMI). > > > The fact that a node is rebooting isn't enough. The node that > > > initiated fencing > > > has to know that it did actually work. But we're just guessing > > > here. Logs should > > > show what is actually going on. > > > > > > Klaus > > > > >* Last updated: Mon Jun 26 12:44:15 2023 > > > > >* Last change: Mon Jun 26 12:41:12 2023 by root via > > > > cibadmin on FILE-2 > > > > >* 4 nodes configured > > > > >* 11 resource instances configured > > > > > > > > > > Node List: > > > > >* Node FILE-1: UNCLEAN (offline) > > > > >* Node FILE-4: UNCLEAN (offline) > > > > >* Online: [ FILE-2 ] > > > > >* Online: [ FILE-3 ] > > > > > > > > > > At this stage FILE-1 and FILE-4 were continuously getting > > > > fenced (we have > > > > > device based stonith configured but the resource was not up ) . > > > > > Two nodes were online and two were offline. So quorum wasn't > > > > attained > > > > > again. > > > > > 1) For such a scenario we need help to be able to have one > > > > cluster live . > > > > > 2) And in cases where only one node of the cluster is up and > > > > others are > > > > > down we need the resources and cluster to be up . > > > > > > > > > > Thanks > > > > > Priyanka > > > > > > > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov < > > > > arvidj...@gmail.com> > > > > > wrote: > > > > > > > > > >> On 26.06.2023 21:14, Priyanka Balotra wrote: > > > > >>> Hi All, > > > > >>> We are seeing an issue where we replaced no-quorum- > > > > policy=ignore with > > > > >> other > > > > >>> options in corosync.conf order to simulate the same behaviour > > > > : > > > > >>> > > > > >>> > > > > >>> * wait_for_all: 0* > > > > >>> > > > > >>> *last_man_standing: 1 > > > > last_man_standing_window: 2* > > > > >>> > > > > >>> There was another property (aut
Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
In this case stonith has been configured as a resource, *primitive stonith-sbd stonith:external/sbd* For it to be functional properly , the resource needs to be up, which is only possible if the system is quorate. Hence our requirement is to make the system quorate even if one Node of the cluster is up. Stonith will then take care of any split-brain scenarios. Thanks Priyanka On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger wrote: > > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov > wrote: > >> On 27.06.2023 07:21, Priyanka Balotra wrote: >> > Hi Andrei, >> > After this state the system went through some more fencings and we saw >> the >> > following state: >> > >> > :~ # crm status >> > Cluster Summary: >> >* Stack: corosync >> >* Current DC: FILE-2 (version >> > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) - >> partition >> > with quorum >> >> It says "partition with quorum" so what exactly is the problem? >> > > I guess the problem is that resources aren't being recovered on > the nodes in the quorate partition. > Reason for that is probably that - as Ken was already suggesting - fencing > isn't > working properly or fencing-devices used are simply inappropriate for the > purpose (e.g. onboard IPMI). > The fact that a node is rebooting isn't enough. The node that initiated > fencing > has to know that it did actually work. But we're just guessing here. Logs > should > show what is actually going on. > > Klaus > >> >> >* Last updated: Mon Jun 26 12:44:15 2023 >> >* Last change: Mon Jun 26 12:41:12 2023 by root via cibadmin on >> FILE-2 >> >* 4 nodes configured >> >* 11 resource instances configured >> > >> > Node List: >> >* Node FILE-1: UNCLEAN (offline) >> >* Node FILE-4: UNCLEAN (offline) >> >* Online: [ FILE-2 ] >> >* Online: [ FILE-3 ] >> > >> > At this stage FILE-1 and FILE-4 were continuously getting fenced (we >> have >> > device based stonith configured but the resource was not up ) . >> > Two nodes were online and two were offline. So quorum wasn't attained >> > again. >> > 1) For such a scenario we need help to be able to have one cluster >> live . >> > 2) And in cases where only one node of the cluster is up and others are >> > down we need the resources and cluster to be up . >> > >> > Thanks >> > Priyanka >> > >> > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov >> > wrote: >> > >> >> On 26.06.2023 21:14, Priyanka Balotra wrote: >> >>> Hi All, >> >>> We are seeing an issue where we replaced no-quorum-policy=ignore with >> >> other >> >>> options in corosync.conf order to simulate the same behaviour : >> >>> >> >>> >> >>> * wait_for_all: 0* >> >>> >> >>> *last_man_standing: 1last_man_standing_window: 2* >> >>> >> >>> There was another property (auto-tie-breaker) tried but couldn't >> >> configure >> >>> it as crm did not recognise this property. >> >>> >> >>> But even after using these options, we are seeing that system is not >> >>> quorate if at least half of the nodes are not up. >> >>> >> >>> Some properties from crm config are as follows: >> >>> >> >>> >> >>> >> >>> *primitive stonith-sbd stonith:external/sbd \params >> >>> pcmk_delay_base=5s.* >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> *.property cib-bootstrap-options: \have-watchdog=true \ >> >>> >> >> >> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36" >> >>> \cluster-infrastructure=corosync \cluster-name=FILE \ >> >>> stonith-enabled=true \stonith-timeout=172 \ >> >>> stonith-action=reboot \stop-all-resources=false \ >> >>> no-quorum-po
Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
Hi Andrei, After this state the system went through some more fencings and we saw the following state: :~ # crm status Cluster Summary: * Stack: corosync * Current DC: FILE-2 (version 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) - partition with quorum * Last updated: Mon Jun 26 12:44:15 2023 * Last change: Mon Jun 26 12:41:12 2023 by root via cibadmin on FILE-2 * 4 nodes configured * 11 resource instances configured Node List: * Node FILE-1: UNCLEAN (offline) * Node FILE-4: UNCLEAN (offline) * Online: [ FILE-2 ] * Online: [ FILE-3 ] At this stage FILE-1 and FILE-4 were continuously getting fenced (we have device based stonith configured but the resource was not up ) . Two nodes were online and two were offline. So quorum wasn't attained again. 1) For such a scenario we need help to be able to have one cluster live . 2) And in cases where only one node of the cluster is up and others are down we need the resources and cluster to be up . Thanks Priyanka On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov wrote: > On 26.06.2023 21:14, Priyanka Balotra wrote: > > Hi All, > > We are seeing an issue where we replaced no-quorum-policy=ignore with > other > > options in corosync.conf order to simulate the same behaviour : > > > > > > * wait_for_all: 0* > > > > *last_man_standing: 1last_man_standing_window: 2* > > > > There was another property (auto-tie-breaker) tried but couldn't > configure > > it as crm did not recognise this property. > > > > But even after using these options, we are seeing that system is not > > quorate if at least half of the nodes are not up. > > > > Some properties from crm config are as follows: > > > > > > > > *primitive stonith-sbd stonith:external/sbd \params > > pcmk_delay_base=5s.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *.property cib-bootstrap-options: \have-watchdog=true \ > > > dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36" > > \cluster-infrastructure=corosync \cluster-name=FILE \ > >stonith-enabled=true \stonith-timeout=172 \ > > stonith-action=reboot \stop-all-resources=false \ > > no-quorum-policy=ignorersc_defaults build-resource-defaults: \ > > resource-stickiness=1rsc_defaults rsc-options: \ > > resource-stickiness=100 \migration-threshold=3 \ > > failure-timeout=1m \cluster-recheck-interval=10minop_defaults > > op-options: \timeout=600 \record-pending=true* > > > > On a 4-node setup when the whole cluster is brought up together we see > > error logs like: > > > > *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]: > > warning: Fencing and resource management disabled due to lack of quorum* > > > > *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]: > > warning: Ignoring malformed node_state entry without uname* > > > > *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]: > > warning: Node FILE-2 is unclean!* > > > > *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]: > > warning: Node FILE-3 is unclean!* > > > > *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]: > > warning: Node FILE-4 is unclean!* > > > > According to this output FILE-1 lost connection to three other nodes, in > which case it cannot be quorate. > > > > > Kindly help correct the configuration to make the system function > normally > > with all resources up, even if there is just one node up. > > > > Please let me know if any more info is needed. > > > > Thanks > > Priyanka > > > > > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
Hi All, We are seeing an issue where we replaced no-quorum-policy=ignore with other options in corosync.conf order to simulate the same behaviour : * wait_for_all: 0* *last_man_standing: 1last_man_standing_window: 2* There was another property (auto-tie-breaker) tried but couldn't configure it as crm did not recognise this property. But even after using these options, we are seeing that system is not quorate if at least half of the nodes are not up. Some properties from crm config are as follows: *primitive stonith-sbd stonith:external/sbd \params pcmk_delay_base=5s.* *.property cib-bootstrap-options: \have-watchdog=true \ dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36" \cluster-infrastructure=corosync \cluster-name=FILE \ stonith-enabled=true \stonith-timeout=172 \ stonith-action=reboot \stop-all-resources=false \ no-quorum-policy=ignorersc_defaults build-resource-defaults: \ resource-stickiness=1rsc_defaults rsc-options: \ resource-stickiness=100 \migration-threshold=3 \ failure-timeout=1m \cluster-recheck-interval=10minop_defaults op-options: \timeout=600 \record-pending=true* On a 4-node setup when the whole cluster is brought up together we see error logs like: *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]: warning: Fencing and resource management disabled due to lack of quorum* *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]: warning: Ignoring malformed node_state entry without uname* *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]: warning: Node FILE-2 is unclean!* *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]: warning: Node FILE-3 is unclean!* *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]: warning: Node FILE-4 is unclean!* Kindly help correct the configuration to make the system function normally with all resources up, even if there is just one node up. Please let me know if any more info is needed. Thanks Priyanka ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] crm node stays online after issuing node standby command
+Ayush Thanks On Wed, 15 Mar 2023 at 8:17 PM, Ken Gaillot wrote: > Hi, > > If you can reproduce the problem, the following info would be helpful: > > * "cibadmin -Q | grep standby" : to show whether it was successfully > recorded in the CIB (will show info for any node with standby, but the > XML ID likely has the node name or ID in it) > > * "attrd_updater -Q -n standby -N FILE-2" : to show whether the > attribute manager has the right value in memory for the affected node > > > On Wed, 2023-03-15 at 15:51 +0530, Ayush Siddarath wrote: > > Hi All, > > > > We are seeing an issue as part of crm maintenance operations. As part > > of the upgrade process, the crm nodes are put into standby mode. > > But it's observed that one of the nodes fails to go into standby mode > > despite the "crm node standby" returning success. > > > > Commands issued to put nodes into maintenance : > > > > > [2023-03-15 06:07:08 +] [468] [INFO] changed: [FILE-1] => > > > {"changed": true, "cmd": "/usr/sbin/crm node standby FILE-1", > > > "delta": "0:00:00.442615", "end": "2023-03-15 06:07:08.150375", > > > "rc": 0, "start": "2023-03-15 06:07:07.707760", "stderr": "", > > > "stderr_lines": [], "stdout": "\u001b[32mINFO\u001b[0m: standby > > > node FILE-1", "stdout_lines": ["\u001b[32mINFO\u001b[0m: standby > > > node FILE-1"]} > > > . > > > [2023-03-15 06:07:08 +] [468] [INFO] changed: [FILE-2] => > > > {"changed": true, "cmd": "/usr/sbin/crm node standby FILE-2", > > > "delta": "0:00:00.459407", "end": "2023-03-15 06:07:08.223749", > > > "rc": 0, "start": "2023-03-15 06:07:07.764342", "stderr": "", > > > "stderr_lines": [], "stdout": "\u001b[32mINFO\u001b[0m: standby > > > node FILE-2", "stdout_lines": ["\u001b[32mINFO\u001b[0m: standby > > > node FILE-2"]} > > > > > > > > Crm status o/p after above command execution: > > > > > FILE-2:/var/log # crm status > > > Cluster Summary: > > > * Stack: corosync > > > * Current DC: FILE-1 (version 2.1.2+20211124.ada5c3b36- > > > 150400.2.43-2.1.2+20211124.ada5c3b36) - partition with quorum > > > * Last updated: Wed Mar 15 08:32:27 2023 > > > * Last change: Wed Mar 15 06:07:08 2023 by root via cibadmin on > > > FILE-4 > > > * 4 nodes configured > > > * 11 resource instances configured (5 DISABLED) > > > Node List: > > > * Node FILE-1: standby (with active resources) > > > * Node FILE-3: standby (with active resources) > > > * Node FILE-4: standby (with active resources) > > > * Online: [ FILE-2 ] > > > > pacemaker logs indicate that FILE-2 received the commands to put it > > into standby. > > > > > FILE-2:/var/log # grep standby /var/log/pacemaker/pacemaker.log > > > Mar 15 06:07:08.098 FILE-2 pacemaker-based [8635] > > > (cib_perform_op) info: ++ > > > > > value="on"/> > > > Mar 15 06:07:08.166 FILE-2 pacemaker-based [8635] > > > (cib_perform_op) info: ++ > > > > > value="on"/> > > > Mar 15 06:07:08.170 FILE-2 pacemaker-based [8635] > > > (cib_perform_op) info: ++ > > > > > value="on"/> > > > Mar 15 06:07:08.230 FILE-2 pacemaker-based [8635] > > > (cib_perform_op) info: ++ > > > > > value="on"/> > > > > > > Issue is quite intermittent and observed on other nodes as well. > > We have seen a similar issue when we try to remove the node from > > standby mode (using crm node online) command. One/more nodes fails to > > get removed from standby mode. > > > > We suspect it could be an issue with parallel execution of node > > standby/online command for all nodes but this issue wasn't observed > > with pacemaker packaged with SLES15 SP2 OS. > > > > I'm attaching the pacemaker.log from FILE-2 for analysis. Let us know > > if any additional information is required. > > > > OS: SLES15 SP4 > > Pacemaker version --> > > crmadmin --version > > Pacemaker 2.1.2+20211124.ada5c3b36-150400.2.43 > > > > Thanks, > > Ayush > > > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > -- > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] pacemaker-fenced[11637]: warning: Can't create a sane reply
Hi Klaus, The config is as follows: There are 2 nodes in the setup and some resources configured (stonith, IP, systemd services related). Sorry, I can share only high level details for this. - pacemaker version # rpm -qa pacemaker pacemaker-2.0.3+20200511.2b248d828-1.10.x86_64 # rpm -qa corosync corosync-2.4.5-10.14.6.1.x86_64 # rpm -qa crmsh crmsh-4.2.0+git.1585096577.f3257c89-3.4.noarch On Wed, Jun 22, 2022 at 5:45 PM Klaus Wenninger wrote: > On Wed, Jun 22, 2022 at 1:46 PM Priyanka Balotra > wrote: > > > > Hi All, > > > > We are seeing an issue where we performed cluster shutdown followed by > cluster boot operation. All the nodes joined the cluster excet one (the > first node). Here are some pacemaker logs around that timestamp: > > > > 2022-06-19T07:02:08.690213+00:00 FILE-1 pacemaker-fenced[11637]: > notice: Operation 'off' targeting FILE-1 on FILE-2 for > pacemaker-controld.11523@FILE-2.0b09e949: OK > > > > 2022-06-19T07:02:08.690604+00:00 FILE-1 pacemaker-fenced[11637]: error: > stonith_construct_reply: Triggered assert at fenced_commands.c:2363 : > request != NULL > > > > 2022-06-19T07:02:08.690781+00:00 FILE-1 pacemaker-fenced[11637]: > warning: Can't create a sane reply > > > > 2022-06-19T07:02:08.691872+00:00 FILE-1 pacemaker-controld[11643]: > crit: We were allegedly just fenced by FILE-2 for FILE-2! > > > > 2022-06-19T07:02:08.693994+00:00 FILE-1 pacemakerd[11622]: warning: > Shutting cluster down because pacemaker-controld[11643] had fatal failure > > > > 2022-06-19T07:02:08.694209+00:00 FILE-1 pacemakerd[11622]: notice: > Shutting down Pacemaker > > > > 2022-06-19T07:02:08.694381+00:00 FILE-1 pacemakerd[11622]: notice: > Stopping pacemaker-schedulerd > > > > > > > > Let us know if you need any more logs to find an rca to this. > > A little bit more info about your configuration and the pacemaker-version > (cib?) > used would definitely be helpful. > > Klaus > > > > Thanks > > Priyanka > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] pacemaker-fenced[11637]: warning: Can't create a sane reply
Hi All, We are seeing an issue where we performed cluster shutdown followed by cluster boot operation. All the nodes joined the cluster excet one (the first node). Here are some pacemaker logs around that timestamp: 2022-06-19T07:02:08.690213+00:00 FILE-1 pacemaker-fenced[11637]: notice: Operation 'off' targeting FILE-1 on FILE-2 for pacemaker-controld.11523@FILE-2.0b09e949: OK 2022-06-19T07:02:08.690604+00:00 FILE-1 pacemaker-fenced[11637]: *error: stonith_construct_reply: Triggered assert at fenced_commands.c:2363 : request != NULL* 2022-06-19T07:02:08.690781+00:00 FILE-1 pacemaker-fenced[11637]: warning: *Can't create a sane reply* 2022-06-19T07:02:08.691872+00:00 FILE-1 pacemaker-controld[11643]: crit: We were allegedly just fenced by FILE-2 for FILE-2! 2022-06-19T07:02:08.693994+00:00 FILE-1 pacemakerd[11622]: warning: Shutting cluster down because pacemaker-controld[11643] had fatal failure 2022-06-19T07:02:08.694209+00:00 FILE-1 pacemakerd[11622]: notice: Shutting down Pacemaker 2022-06-19T07:02:08.694381+00:00 FILE-1 pacemakerd[11622]: notice: Stopping pacemaker-schedulerd Let us know if you need any more logs to find an rca to this. Thanks Priyanka ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] crm status shows CURRENT DC as None
Hi Folks, crm status shows CURRENT DC as None. Please check and let us know why the current DC is not pointing to any of the nodes. *CRM Status:* Cluster Summary: * Stack: corosync * * Current DC: NONE* * Last updated: Tue Jun 7 06:14:59 2022 * Last change: Tue Jun 7 05:29:40 2022 by root via cibadmin on FILE-2 * 2 nodes configured * 9 resource instances configured - How the current DC will be set to any node once we see as *None*? - Is there any impact on cluster functionality? Thanks Priyanka ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)
Hi All, We have a scenario on SLES 12 SP3 cluster. The scenario is explained as follows in the order of events: - There is a 2-node cluster (FILE-1, FILE-2) - The cluster and the resources were up and running fine initially . - Then fencing request from pacemaker got issued on both nodes simultaneously Logs from 1st node: 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] Failed to receive the leave message. failed: 2 . . 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: notice: Requesting that FILE-1 perform 'off' action targeting FILE-2 Logs from 2nd node: 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] Failed to receive the leave message. failed: 1 . . Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith) notice: Requesting that FILE-2 perform 'off' action targeting FILE-1 - When the nodes came up after unfencing, the DC got set after election - After that the resources which were expected to run on only one node became active on both (all) nodes of the cluster. 27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource stonith-sbd is active on 2 nodes (attempting recovery) 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource FILE_Filesystem is active on 2 nodes (attem pting recovery) 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource IP_Floating is active on 2 nodes (attemptin g recovery) 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_Postgresql is active on 2 nodes (at tempting recovery) 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_Postgrest is active on 2 nodes (att empting recovery) 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_esm_primary is active on 2 nodes (a ttempting recovery) 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Shared_Cluster_Backup is active on 2 nodes (attempting recovery) Can you guys please help us understand if this is indeed a split-brain scenario ? Under what circumstances can such a scenario be observed? We can have very serious impact if such a case can re-occur inspite of stonith already configured. Hence the ask . In case this situation gets reproduced, how can it be handled? Note: We have stonith configured and it has been working fine so far. In this case also, the initial fencing happened from stonith only. Thanks in advance! Priyanka ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Recall: Resources too_active (active on all nodes of the cluster, instead of only 1 node)
Balotra, Priyanka would like to recall the message, "Resources too_active (active on all nodes of the cluster, instead of only 1 node)". ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)
Hi All, We have a scenario on SLES 12 SP3 cluster. The scenario is explained as follows in the order of events: * There is a 2-node cluster (FILE-1, FILE-2) * The cluster and the resources were up and running fine initially . * Then fencing request from pacemaker got issued on both nodes simultaneously Logs from 1st node: 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] Failed to receive the leave message. failed: 2 . . 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: notice: Requesting that FILE-1 perform 'off' action targeting FILE-2 Logs from 2nd node: 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] Failed to receive the leave message. failed: 1 . . Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith) notice: Requesting that FILE-2 perform 'off' action targeting FILE-1 * When the nodes came up after unfencing, the DC got set after election * After that the resources which were expected to run on only one node became active on both (all) nodes of the cluster. 27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource stonith-sbd is active on 2 nodes (attempting recovery) 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource FILE_Filesystem is active on 2 nodes (attem pting recovery) 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource IP_Floating is active on 2 nodes (attemptin g recovery) 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_Postgresql is active on 2 nodes (at tempting recovery) 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_Postgrest is active on 2 nodes (att empting recovery) 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_esm_primary is active on 2 nodes (a ttempting recovery) 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Shared_Cluster_Backup is active on 2 nodes (attempting recovery) Can you guys please help us understand if this is indeed a split-brain scenario ? Under what circumstances can such a scenario be observed? We can have very serious impact if such a case can re-occur inspite of stonith already configured. Hence the ask . In case this situation gets reproduced, how can it be handled? Note: We have stonith configured and it has been working fine so far. In this case also, the initial fencing happened from stonith only. Thanks in advance! Internal Use - Confidential ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Corosync+Pacemaker error during failover
On 2015-10-08 21:20, Ken Gaillot wrote: On 10/08/2015 10:16 AM, priyanka wrote: Hi, We are trying to build a HA setup for our servers using DRBD + Corosync + pacemaker stack. Attached is the configuration file for corosync/pacemaker and drbd. A few things I noticed: * Don't set become-primary-on in the DRBD configuration in a Pacemaker cluster; Pacemaker should handle all promotions to primary. * I'm no NFS expert, but why is res_exportfs_root cloned? Can both servers export it at the same time? I would expect it to be in the group before res_exportfs_export1. We have followed following configuration guide for our setup, https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_techguides/book_sleha_techguides.html which suggests to create clone of this resource. This resource will not export actual data, data is exported by exportfs_export1 resource in our setup. I did try the previous fail-over scenario without cloning this resource but same error appeared. * Your constraints need some adjustment. Partly it depends on the answer to the previous question, but currently res_fs (via the group) is ordered after res_exportfs_root, and I don't see how that could work. We are getting errors while testing this setup. 1. When we stop corosync on Master machine say server1(lock), it is Stonith'ed. In this case slave-server2(sher) is promoted to master. But when server1(lock) reboots res_exportfs_export1 is started on both the servers and that resource goes into failed state followed by servers going into unclean state. Then server1(lock) reboots and server2(sher) is master but in unclean state. After server1(lock) comes up, server2(sher) is stonith'ed and server1(lock) is slave(the only online node). When server2(sher) comes up, both the servers are slaves and resource group(rg_export) is stopped. Then server2(sher) becomes Master and server1(lock) is slave and resource group is started. At this point configuration becomes stable. PFA logs(syslog) of server2(sher) after it is promoted to master till it is first rebooted when resource exportfs goes into failed state. Please let us know if the configuration is appropriate. From the logs we could not figure out exact reason of resource failure. Your comment on this scenario will be very helpful. Thanks, Priyanka ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Regards, Priyanka MTech3 Sysad IIT Powai ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Corosync+Pacemaker error during failover
On 2015-10-08 21:05, Digimer wrote: On 08/10/15 11:16 AM, priyanka wrote: fencing resource-only; This needs to be 'fencing resource-and-stonith;'. I did set the suggested parameter but error persists. Apparently node which comes back after fail-over is not able to detect res_exportfs_root on current master. Following is the log trace: Jan 14 16:37:18 sher pengine[1383]: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 14 16:37:18 sher pengine[1383]: warning: unpack_rsc_op: Processing failed op monitor for res_exportfs_root:0 on sher: not running (7) Jan 14 16:37:18 sher pengine[1383]: warning: unpack_rsc_op: Processing failed op monitor for fence_lock on sher: unknown error (1) Jan 14 16:37:18 sher pengine[1383]:error: native_create_actions: Resource res_exportfs_export1 (ocf::exportfs) is active on 2 nodes attempting recovery Jan 14 16:37:18 sher pengine[1383]: warning: native_create_actions: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. Jan 14 16:37:18 sher pengine[1383]: notice: LogActions: Start fence_sher#011(lock) Jan 14 16:37:18 sher pengine[1383]: notice: LogActions: Start res_drbd_export:1#011(lock) Jan 14 16:37:18 sher pengine[1383]: notice: LogActions: Restart res_exportfs_export1#011(Started sher) Jan 14 16:37:18 sher pengine[1383]: notice: LogActions: Start res_nfsserver:1#011(lock) Jan 14 16:37:18 sher pengine[1383]:error: process_pe_message: Calculated Transition 7: /var/lib/pacemaker/pengine/pe-error-352.bz2 Jan 14 16:37:18 sher crmd[1384]: notice: te_rsc_command: Initiating action 11: start fence_sher_start_0 on lock Jan 14 16:37:18 sher crmd[1384]: notice: te_rsc_command: Initiating action 50: stop res_exportfs_export1_stop_0 on lock Jan 14 16:37:18 sher crmd[1384]: notice: te_rsc_command: Initiating action 49: stop res_exportfs_export1_stop_0 on sher (local) Jan 14 16:37:18 sher crmd[1384]: notice: te_rsc_command: Initiating action 68: monitor res_exportfs_root_monitor_3 on lock Jan 14 16:37:18 sher crmd[1384]: notice: te_rsc_command: Initiating action 76: notify res_drbd_export_pre_notify_start_0 on sher (local) Jan 14 16:37:18 sher crmd[1384]: notice: te_rsc_command: Initiating action 58: start res_nfsserver_start_0 on lock I have pacemaker 1.1.10 installed in my setup, should I try upgrade? -- Regards, Priyanka ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Corosync+Pacemaker error during failover
On 2015-10-08 20:52, emmanuel segura wrote: please check if you drbd is configured to call fence-handler https://drbd.linbit.com/users-guide/s-pacemaker-fencing.html yes. 2015-10-08 17:16 GMT+02:00 priyanka : Hi, We are trying to build a HA setup for our servers using DRBD + Corosync + pacemaker stack. Attached is the configuration file for corosync/pacemaker and drbd. We are getting errors while testing this setup. 1. When we stop corosync on Master machine say server1(lock), it is Stonith'ed. In this case slave-server2(sher) is promoted to master. But when server1(lock) reboots res_exportfs_export1 is started on both the servers and that resource goes into failed state followed by servers going into unclean state. Then server1(lock) reboots and server2(sher) is master but in unclean state. After server1(lock) comes up, server2(sher) is stonith'ed and server1(lock) is slave(the only online node). When server2(sher) comes up, both the servers are slaves and resource group(rg_export) is stopped. Then server2(sher) becomes Master and server1(lock) is slave and resource group is started. At this point configuration becomes stable. PFA logs(syslog) of server2(sher) after it is promoted to master till it is first rebooted when resource exportfs goes into failed state. Please let us know if the configuration is appropriate. From the logs we could not figure out exact reason of resource failure. Your comment on this scenario will be very helpful. Thanks, Priyanka ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Regards, Priyanka MTech3 Sysad IIT Powai ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Corosync+Pacemaker error during failover
Hi, We are trying to build a HA setup for our servers using DRBD + Corosync + pacemaker stack. Attached is the configuration file for corosync/pacemaker and drbd. We are getting errors while testing this setup. 1. When we stop corosync on Master machine say server1(lock), it is Stonith'ed. In this case slave-server2(sher) is promoted to master. But when server1(lock) reboots res_exportfs_export1 is started on both the servers and that resource goes into failed state followed by servers going into unclean state. Then server1(lock) reboots and server2(sher) is master but in unclean state. After server1(lock) comes up, server2(sher) is stonith'ed and server1(lock) is slave(the only online node). When server2(sher) comes up, both the servers are slaves and resource group(rg_export) is stopped. Then server2(sher) becomes Master and server1(lock) is slave and resource group is started. At this point configuration becomes stable. PFA logs(syslog) of server2(sher) after it is promoted to master till it is first rebooted when resource exportfs goes into failed state. Please let us know if the configuration is appropriate. From the logs we could not figure out exact reason of resource failure. Your comment on this scenario will be very helpful. Thanks, Priyanka sher(new master) => Oct 8 18:01:20 sher kernel: [ 886.867496] e1000e: eth0 NIC Link is Down Oct 8 18:01:22 sher exportfs(res_exportfs_root)[5566]: INFO: Directory /mnt/vms is exported to 10.105.0.0/255.255.0.0 (started). Oct 8 18:01:27 sher exportfs(res_exportfs_export1)[5580]: INFO: Directory /mnt/vms/export1 is exported to 10.105.0.0/255.255.0.0 (started). Oct 8 18:01:30 sher kernel: [ 896.771854] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Oct 8 18:01:30 sher corosync[1320]: [TOTEM ] A new membership (192.168.0.21:3444) was formed. Members joined: 102 Oct 8 18:01:30 sher crmd[1465]:error: pcmk_cpg_membership: Node lock[102] appears to be online even though we think it is dead Oct 8 18:01:30 sher crmd[1465]: notice: crm_update_peer_state: pcmk_cpg_membership: Node lock[102] - state is now member (was lost) Oct 8 18:01:30 sher corosync[1320]: [QUORUM] Members[2]: 101 102 Oct 8 18:01:30 sher corosync[1320]: [MAIN ] Completed service synchronization, ready to provide service. Oct 8 18:01:30 sher pacemakerd[1458]: notice: crm_update_peer_state: pcmk_quorum_notification: Node lock[102] - state is now member (was lost) Oct 8 18:01:32 sher exportfs(res_exportfs_root)[5652]: INFO: Directory /mnt/vms is exported to 10.105.0.0/255.255.0.0 (started). Oct 8 18:01:37 sher exportfs(res_exportfs_export1)[5666]: INFO: Directory /mnt/vms/export1 is exported to 10.105.0.0/255.255.0.0 (started). Oct 8 18:01:43 sher exportfs(res_exportfs_root)[5738]: INFO: Directory /mnt/vms is exported to 10.105.0.0/255.255.0.0 (started). Oct 8 18:01:47 sher exportfs(res_exportfs_export1)[5779]: INFO: Directory /mnt/vms/export1 is exported to 10.105.0.0/255.255.0.0 (started). Oct 8 18:01:49 sher crmd[1465]: notice: do_state_transition: State transition S_IDLE -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=do_election_count_vote ] Oct 8 18:01:49 sher crmd[1465]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Oct 8 18:01:49 sher attrd[1463]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Oct 8 18:01:49 sher attrd[1463]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-res_exportfs_root (2) Oct 8 18:01:49 sher attrd[1463]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-res_drbd_export (1) Oct 8 18:01:49 sher attrd[1463]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-res_exportfs_root (1444306659) Oct 8 18:01:49 sher attrd[1463]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Oct 8 18:01:49 sher attrd[1463]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-fence_lock (INFINITY) Oct 8 18:01:49 sher attrd[1463]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-fence_lock (1444306771) Oct 8 18:01:50 sher pengine[1464]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 8 18:01:50 sher pengine[1464]: warning: unpack_rsc_op: Processing failed op monitor for res_exportfs_root:0 on sher: not running (7) Oct 8 18:01:50 sher pengine[1464]: warning: unpack_rsc_op: Processing failed op start for fence_lock on sher: unknown error (1) Oct 8 18:01:50 sher pengine[1464]: warning: common_apply_stickiness: Forcing fence_lock away from sher after 100 failures (max=100) Oct 8 18:01:50 sher pengine[1464]: notice: LogActions: Start fence_sher#011(lock) Oct 8 18:01:50 sher pengine[1464]: notice: LogActions: Start