from:"priyanka"

Re: [ClusterLabs] Pacemaker fatal shutdown

2023-07-19 Thread Priyanka Balotra

SA action flags 0x0020
(A_INTEGRATE_TIMER_STOP) for controller set by do_state_transition:559
63835:Jul 17 14:16:55.092 FILE-2 pacemaker-controld  [15962]
(pcmk__set_flags_as)   debug: FSA action flags 0x0080
(A_FINALIZE_TIMER_STOP) for controller set by do_state_transition:565
63836:Jul 17 14:16:55.092 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0200 (an_action)
for controller cleared by do_fsa_action:108
63837:Jul 17 14:16:55.092 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0020 (an_action)
for controller cleared by do_fsa_action:108
63838:Jul 17 14:16:55.092 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0080 (an_action)
for controller cleared by do_fsa_action:108
63863:Jul 17 14:17:25.073 FILE-2 pacemaker-controld  [15962]
(throttle_cib_load)debug: cib load: 0.000667 (2 ticks in 30s)
63864:Jul 17 14:17:25.073 FILE-2 pacemaker-controld  [15962]
(throttle_mode)debug: Current load is 0.65 across 10 core(s)
63865:Jul 17 14:17:55.073 FILE-2 pacemaker-controld  [15962]
(throttle_cib_load)debug: cib load: 0.000333 (1 ticks in 30s)
63866:Jul 17 14:17:55.073 FILE-2 pacemaker-controld  [15962]
(throttle_mode)debug: Current load is 0.85 across 10 core(s)
63868:Jul 17 14:18:20.085 FILE-2 pacemaker-fenced[15958]
(process_remote_stonith_exec)  debug: Finalizing action 'reboot'
targeting FILE-2 on behalf of pacemaker-controld.19415@FILE-6: OK | rc=0
id=4e523b34
63869:Jul 17 14:18:20.085 FILE-2 pacemaker-fenced[15958]
(remote_op_done)   notice: Operation 'reboot' targeting FILE-2 by FILE-4
for pacemaker-controld.19415@FILE-6: OK | id=4e523b34
63872:Jul 17 14:18:20.089 FILE-2 pacemaker-controld  [15962]
(exec_alert_list)  info: Sending fencing alert via pf-ha-alert to (null)
63875:Jul 17 14:18:20.089 FILE-2 pacemaker-controld  [15962]
(tengine_stonith_notify)   crit: We were allegedly just fenced by FILE-4
for FILE-6!
63876:Jul 17 14:18:20.089 FILE-2 pacemaker-controld  [15962]
(crm_xml_cleanup)  info: Cleaning up memory from libxml2
63877:Jul 17 14:18:20.089 FILE-2 pacemaker-controld  [15962] (crm_exit)
info: Exiting pacemaker-controld | with status 100
63900:Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_child_exit)  warning: Shutting cluster down because
pacemaker-controld[15962] had fatal failure
63902:Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker) debug: pacemaker-controld confirmed stopped
63956:Jul 17 14:18:20.101 FILE-2 pacemaker-fenced[15958]
(process_remote_stonith_exec)  debug: Finalizing action 'reboot'
targeting FILE-1 on behalf of pacemaker-controld.19415@FILE-6: OK | rc=0
id=446afc42
63957:Jul 17 14:18:20.101 FILE-2 pacemaker-fenced[15958]
(remote_op_done)   notice: Operation 'reboot' targeting FILE-1 by FILE-5
for pacemaker-controld.19415@FILE-6: OK | id=446afc42

Thanks
Priyanka

On Thu, Jul 20, 2023 at 12:07 AM Ken Gaillot  wrote:

> On Wed, 2023-07-19 at 23:49 +0530, Priyanka Balotra wrote:
> > Hi All,
> > I am using SLES 15 SP4. One of the nodes of the cluster is brought
> > down and boot up after sometime. Pacemaker service came up first but
> > later it faced a fatal shutdown. Due to that crm service is down.
> >
> > The logs from /var/log/pacemaker.pacemaker.log are as follows:
> >
> > Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> > (pcmk_child_exit)warning: Shutting cluster down because
> > pacemaker-controld[15962] had fatal failure
>
> The interesting messages will be before this. The ones with "pacemaker-
> controld" will be the most relevant, at least initially.
>
> > Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> > (pcmk_shutdown_worker)   notice: Shutting down Pacemaker
> > Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> > (pcmk_shutdown_worker)   debug: pacemaker-controld confirmed stopped
> > Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)
> >   notice: Stopping pacemaker-schedulerd | sent signal 15 to process
> > 15961
> > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> > (crm_signal_dispatch)notice: Caught 'Terminated' signal | 15
> > (invoking handler)
> > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> > (qb_ipcs_us_withdraw)info: withdrawing server sockets
> > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> > (qb_ipcs_unref)  debug: qb_ipcs_unref() - destroying
> > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> > (crm_xml_cleanup)info: Cleaning up memory from libxml2
> > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_exit)
> >   info: Exiting pacemak

[ClusterLabs] Pacemaker fatal shutdown

2023-07-19 Thread Priyanka Balotra

Hi All,
I am using SLES 15 SP4. One of the nodes of the cluster is brought down and
boot up after sometime. Pacemaker service came up first but later it faced
a fatal shutdown. Due to that crm service is down.

The logs from /var/log/pacemaker.pacemaker.log are as follows:

Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (pcmk_child_exit)
 warning: Shutting cluster down because pacemaker-controld[15962] had
fatal failure
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   notice: Shutting down Pacemaker
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   debug: pacemaker-controld confirmed stopped
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)
notice: Stopping pacemaker-schedulerd | sent signal 15 to process 15961
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
(crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 (invoking
handler)
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
(qb_ipcs_us_withdraw)info: withdrawing server sockets
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (qb_ipcs_unref)
 debug: qb_ipcs_unref() - destroying
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_xml_cleanup)
 info: Cleaning up memory from libxml2
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_exit)
info: Exiting pacemaker-schedulerd | with status 0
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(qb_ipcs_event_sendv)debug: new_event_notification
(/dev/shm/qb-15957-15962-12-RDPw6O/qb): Broken pipe (32)
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_notify_send_one)warning: Could not notify client crmd: Broken pipe
| id=e29d175e-7e91-4b6a-bffb-fabfdd7a33bf
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_process_request)info: Completed cib_delete operation for section
//node_state[@uname='FILE-2']/*: OK (rc=0, origin=FILE-6/crmd/74,
version=0.24.75)
Jul 17 14:18:20.093 FILE-2 pacemaker-fenced[15958]
(xml_patch_version_check)debug: Can apply patch 0.24.75 to 0.24.74
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (pcmk_child_exit)
 info: pacemaker-schedulerd[15961] exited with status 0 (OK)
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_process_request)info: Completed cib_modify operation for section
status: OK (rc=0, origin=FILE-6/crmd/75, version=0.24.75)
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   debug: pacemaker-schedulerd confirmed stopped
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)
notice: Stopping pacemaker-attrd | sent signal 15 to process 15960
Jul 17 14:18:20.093 FILE-2 pacemaker-attrd [15960]
(crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 (invoking
handler)

Could you please help me understand the issue here.

Regards
Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Priyanka Balotra

I am using SLES 15 SP4. Is the no-quorum-policy still supported?

Thanks
Priyanka

On Wed, 28 Jun 2023 at 12:46 AM, Ken Gaillot  wrote:

> On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote:
> > In this case stonith has been configured as a resource,
> > primitive stonith-sbd stonith:external/sbd
> >
> > For it to be functional properly , the resource needs to be up, which
> > is only possible if the system is quorate.
>
> Pacemaker can use a fence device even if its resource is not active.
> The resource being active just allows Pacemaker to monitor the device
> regularly.
>
> >
> > Hence our requirement is to make the system quorate even if one Node
> > of the cluster is up.
> > Stonith will then take care of any split-brain scenarios.
>
> In that case it sounds like no-quorum-policy=ignore is actually what
> you want.
>
> >
> > Thanks
> > Priyanka
> >
> > On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger 
> > wrote:
> > >
> > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <
> > > arvidj...@gmail.com> wrote:
> > > > On 27.06.2023 07:21, Priyanka Balotra wrote:
> > > > > Hi Andrei,
> > > > > After this state the system went through some more fencings and
> > > > we saw the
> > > > > following state:
> > > > >
> > > > > :~ # crm status
> > > > > Cluster Summary:
> > > > >* Stack: corosync
> > > > >* Current DC: FILE-2 (version
> > > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36)
> > > > - partition
> > > > > with quorum
> > > >
> > > > It says "partition with quorum" so what exactly is the problem?
> > >
> > > I guess the problem is that resources aren't being recovered on
> > > the nodes in the quorate partition.
> > > Reason for that is probably that - as Ken was already suggesting -
> > > fencing isn't
> > > working properly or fencing-devices used are simply inappropriate
> > > for the
> > > purpose (e.g. onboard IPMI).
> > > The fact that a node is rebooting isn't enough. The node that
> > > initiated fencing
> > > has to know that it did actually work. But we're just guessing
> > > here. Logs should
> > > show what is actually going on.
> > >
> > > Klaus
> > > > >* Last updated: Mon Jun 26 12:44:15 2023
> > > > >* Last change:  Mon Jun 26 12:41:12 2023 by root via
> > > > cibadmin on FILE-2
> > > > >* 4 nodes configured
> > > > >* 11 resource instances configured
> > > > >
> > > > > Node List:
> > > > >* Node FILE-1: UNCLEAN (offline)
> > > > >* Node FILE-4: UNCLEAN (offline)
> > > > >* Online: [ FILE-2 ]
> > > > >* Online: [ FILE-3 ]
> > > > >
> > > > > At this stage FILE-1 and FILE-4 were continuously getting
> > > > fenced (we have
> > > > > device based stonith configured but the resource was not up ) .
> > > > > Two nodes were online and two were offline. So quorum wasn't
> > > > attained
> > > > > again.
> > > > > 1)  For such a scenario we need help to be able to have one
> > > > cluster live .
> > > > > 2)  And in cases where only one node of the cluster is up and
> > > > others are
> > > > > down we need the resources and cluster to be up .
> > > > >
> > > > > Thanks
> > > > > Priyanka
> > > > >
> > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <
> > > > arvidj...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> On 26.06.2023 21:14, Priyanka Balotra wrote:
> > > > >>> Hi All,
> > > > >>> We are seeing an issue where we replaced no-quorum-
> > > > policy=ignore with
> > > > >> other
> > > > >>> options in corosync.conf order to simulate the same behaviour
> > > > :
> > > > >>>
> > > > >>>
> > > > >>> * wait_for_all: 0*
> > > > >>>
> > > > >>> *last_man_standing: 1
> > > > last_man_standing_window: 2*
> > > > >>>
> > > > >>> There was another property (aut

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Priyanka Balotra

In this case stonith has been configured as a resource,
*primitive stonith-sbd stonith:external/sbd*

For it to be functional properly , the resource needs to be up, which is
only possible if the system is quorate.
Hence our requirement is to make the system quorate even if one Node of the
cluster is up.
Stonith will then take care of any split-brain scenarios.

Thanks
Priyanka

On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger  wrote:

>
>
> On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov 
> wrote:
>
>> On 27.06.2023 07:21, Priyanka Balotra wrote:
>> > Hi Andrei,
>> > After this state the system went through some more fencings and we saw
>> the
>> > following state:
>> >
>> > :~ # crm status
>> > Cluster Summary:
>> >* Stack: corosync
>> >* Current DC: FILE-2 (version
>> > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) -
>> partition
>> > with quorum
>>
>> It says "partition with quorum" so what exactly is the problem?
>>
>
> I guess the problem is that resources aren't being recovered on
> the nodes in the quorate partition.
> Reason for that is probably that - as Ken was already suggesting - fencing
> isn't
> working properly or fencing-devices used are simply inappropriate for the
> purpose (e.g. onboard IPMI).
> The fact that a node is rebooting isn't enough. The node that initiated
> fencing
> has to know that it did actually work. But we're just guessing here. Logs
> should
> show what is actually going on.
>
> Klaus
>
>>
>> >* Last updated: Mon Jun 26 12:44:15 2023
>> >* Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on
>> FILE-2
>> >* 4 nodes configured
>> >* 11 resource instances configured
>> >
>> > Node List:
>> >* Node FILE-1: UNCLEAN (offline)
>> >* Node FILE-4: UNCLEAN (offline)
>> >* Online: [ FILE-2 ]
>> >* Online: [ FILE-3 ]
>> >
>> > At this stage FILE-1 and FILE-4 were continuously getting fenced (we
>> have
>> > device based stonith configured but the resource was not up ) .
>> > Two nodes were online and two were offline. So quorum wasn't attained
>> > again.
>> > 1)  For such a scenario we need help to be able to have one cluster
>> live .
>> > 2)  And in cases where only one node of the cluster is up and others are
>> > down we need the resources and cluster to be up .
>> >
>> > Thanks
>> > Priyanka
>> >
>> > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov 
>> > wrote:
>> >
>> >> On 26.06.2023 21:14, Priyanka Balotra wrote:
>> >>> Hi All,
>> >>> We are seeing an issue where we replaced no-quorum-policy=ignore with
>> >> other
>> >>> options in corosync.conf order to simulate the same behaviour :
>> >>>
>> >>>
>> >>> * wait_for_all: 0*
>> >>>
>> >>> *last_man_standing: 1last_man_standing_window: 2*
>> >>>
>> >>> There was another property (auto-tie-breaker) tried but couldn't
>> >> configure
>> >>> it as crm did not recognise this property.
>> >>>
>> >>> But even after using these options, we are seeing that system is not
>> >>> quorate if at least half of the nodes are not up.
>> >>>
>> >>> Some properties from crm config are as follows:
>> >>>
>> >>>
>> >>>
>> >>> *primitive stonith-sbd stonith:external/sbd \params
>> >>> pcmk_delay_base=5s.*
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> *.property cib-bootstrap-options: \have-watchdog=true \
>> >>>
>> >>
>> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"
>> >>> \cluster-infrastructure=corosync \cluster-name=FILE \
>> >>> stonith-enabled=true \stonith-timeout=172 \
>> >>> stonith-action=reboot \stop-all-resources=false \
>> >>> no-quorum-po

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-26 Thread Priyanka Balotra

Hi Andrei,
After this state the system went through some more fencings and we saw the
following state:

:~ # crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: FILE-2 (version
2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) - partition
with quorum
  * Last updated: Mon Jun 26 12:44:15 2023
  * Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on FILE-2
  * 4 nodes configured
  * 11 resource instances configured

Node List:
  * Node FILE-1: UNCLEAN (offline)
  * Node FILE-4: UNCLEAN (offline)
  * Online: [ FILE-2 ]
  * Online: [ FILE-3 ]

At this stage FILE-1 and FILE-4 were continuously getting fenced (we have
device based stonith configured but the resource was not up ) .
Two nodes were online and two were offline. So quorum wasn't attained
again.
1)  For such a scenario we need help to be able to have one cluster live .
2)  And in cases where only one node of the cluster is up and others are
down we need the resources and cluster to be up .

Thanks
Priyanka

On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov 
wrote:

> On 26.06.2023 21:14, Priyanka Balotra wrote:
> > Hi All,
> > We are seeing an issue where we replaced no-quorum-policy=ignore with
> other
> > options in corosync.conf order to simulate the same behaviour :
> >
> >
> > * wait_for_all: 0*
> >
> > *last_man_standing: 1last_man_standing_window: 2*
> >
> > There was another property (auto-tie-breaker) tried but couldn't
> configure
> > it as crm did not recognise this property.
> >
> > But even after using these options, we are seeing that system is not
> > quorate if at least half of the nodes are not up.
> >
> > Some properties from crm config are as follows:
> >
> >
> >
> > *primitive stonith-sbd stonith:external/sbd \params
> > pcmk_delay_base=5s.*
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *.property cib-bootstrap-options: \have-watchdog=true \
> >
> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"
> > \cluster-infrastructure=corosync \cluster-name=FILE \
> >stonith-enabled=true \stonith-timeout=172 \
> > stonith-action=reboot \stop-all-resources=false \
> > no-quorum-policy=ignorersc_defaults build-resource-defaults: \
> > resource-stickiness=1rsc_defaults rsc-options: \
> > resource-stickiness=100 \migration-threshold=3 \
> > failure-timeout=1m \cluster-recheck-interval=10minop_defaults
> > op-options: \timeout=600 \record-pending=true*
> >
> > On a 4-node setup when the whole cluster is brought up together we see
> > error logs like:
> >
> > *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]:
> > warning: Fencing and resource management disabled due to lack of quorum*
> >
> > *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]:
> > warning: Ignoring malformed node_state entry without uname*
> >
> > *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]:
> > warning: Node FILE-2 is unclean!*
> >
> > *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]:
> > warning: Node FILE-3 is unclean!*
> >
> > *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]:
> > warning: Node FILE-4 is unclean!*
> >
>
> According to this output FILE-1 lost connection to three other nodes, in
> which case it cannot be quorate.
>
> >
> > Kindly help correct the configuration to make the system function
> normally
> > with all resources up, even if there is just one node up.
> >
> > Please let me know if any more info is needed.
> >
> > Thanks
> > Priyanka
> >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-26 Thread Priyanka Balotra

Hi All,
We are seeing an issue where we replaced no-quorum-policy=ignore with other
options in corosync.conf order to simulate the same behaviour :


* wait_for_all: 0*

*last_man_standing: 1last_man_standing_window: 2*

There was another property (auto-tie-breaker) tried but couldn't configure
it as crm did not recognise this property.

But even after using these options, we are seeing that system is not
quorate if at least half of the nodes are not up.

Some properties from crm config are as follows:



*primitive stonith-sbd stonith:external/sbd \params
pcmk_delay_base=5s.*




















*.property cib-bootstrap-options: \have-watchdog=true \
dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"
\cluster-infrastructure=corosync \cluster-name=FILE \
  stonith-enabled=true \stonith-timeout=172 \
stonith-action=reboot \stop-all-resources=false \
no-quorum-policy=ignorersc_defaults build-resource-defaults: \
resource-stickiness=1rsc_defaults rsc-options: \
resource-stickiness=100 \migration-threshold=3 \
failure-timeout=1m \cluster-recheck-interval=10minop_defaults
op-options: \timeout=600 \record-pending=true*

On a 4-node setup when the whole cluster is brought up together we see
error logs like:

*2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Fencing and resource management disabled due to lack of quorum*

*2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Ignoring malformed node_state entry without uname*

*2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Node FILE-2 is unclean!*

*2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Node FILE-3 is unclean!*

*2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Node FILE-4 is unclean!*


Kindly help correct the configuration to make the system function normally
with all resources up, even if there is just one node up.

Please let me know if any more info is needed.

Thanks
Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] crm node stays online after issuing node standby command

2023-03-15 Thread Priyanka Balotra

+Ayush

Thanks


On Wed, 15 Mar 2023 at 8:17 PM, Ken Gaillot  wrote:

> Hi,
>
> If you can reproduce the problem, the following info would be helpful:
>
> * "cibadmin -Q | grep standby" : to show whether it was successfully
> recorded in the CIB (will show info for any node with standby, but the
> XML ID likely has the node name or ID in it)
>
> * "attrd_updater -Q -n standby -N FILE-2" : to show whether the
> attribute manager has the right value in memory for the affected node
>
>
> On Wed, 2023-03-15 at 15:51 +0530, Ayush Siddarath wrote:
> > Hi All,
> >
> > We are seeing an issue as part of crm maintenance operations. As part
> > of the upgrade process, the crm nodes are put into standby mode.
> > But it's observed that one of the nodes fails to go into standby mode
> > despite the "crm node standby" returning success.
> >
> > Commands issued to put nodes into maintenance :
> >
> > > [2023-03-15 06:07:08 +] [468] [INFO] changed: [FILE-1] =>
> > > {"changed": true, "cmd": "/usr/sbin/crm node standby FILE-1",
> > > "delta": "0:00:00.442615", "end": "2023-03-15 06:07:08.150375",
> > > "rc": 0, "start": "2023-03-15 06:07:07.707760", "stderr": "",
> > > "stderr_lines": [], "stdout": "\u001b[32mINFO\u001b[0m: standby
> > > node FILE-1", "stdout_lines": ["\u001b[32mINFO\u001b[0m: standby
> > > node FILE-1"]}
> > > .
> > > [2023-03-15 06:07:08 +] [468] [INFO] changed: [FILE-2] =>
> > > {"changed": true, "cmd": "/usr/sbin/crm node standby FILE-2",
> > > "delta": "0:00:00.459407", "end": "2023-03-15 06:07:08.223749",
> > > "rc": 0, "start": "2023-03-15 06:07:07.764342", "stderr": "",
> > > "stderr_lines": [], "stdout": "\u001b[32mINFO\u001b[0m: standby
> > > node FILE-2", "stdout_lines": ["\u001b[32mINFO\u001b[0m: standby
> > > node FILE-2"]}
> >
> >   
> >
> > Crm status o/p after above command execution:
> >
> > > FILE-2:/var/log # crm status
> > > Cluster Summary:
> > >   * Stack: corosync
> > >   * Current DC: FILE-1 (version 2.1.2+20211124.ada5c3b36-
> > > 150400.2.43-2.1.2+20211124.ada5c3b36) - partition with quorum
> > >   * Last updated: Wed Mar 15 08:32:27 2023
> > >   * Last change:  Wed Mar 15 06:07:08 2023 by root via cibadmin on
> > > FILE-4
> > >   * 4 nodes configured
> > >   * 11 resource instances configured (5 DISABLED)
> > > Node List:
> > >   * Node FILE-1: standby (with active resources)
> > >   * Node FILE-3: standby (with active resources)
> > >   * Node FILE-4: standby (with active resources)
> > >   * Online: [ FILE-2 ]
> >
> > pacemaker logs indicate that FILE-2 received the commands to put it
> > into standby.
> >
> > > FILE-2:/var/log # grep standby /var/log/pacemaker/pacemaker.log
> > > Mar 15 06:07:08.098 FILE-2 pacemaker-based [8635]
> > > (cib_perform_op)  info: ++
> > > > > value="on"/>
> > > Mar 15 06:07:08.166 FILE-2 pacemaker-based [8635]
> > > (cib_perform_op)  info: ++
> > > > > value="on"/>
> > > Mar 15 06:07:08.170 FILE-2 pacemaker-based [8635]
> > > (cib_perform_op)  info: ++
> > > > > value="on"/>
> > > Mar 15 06:07:08.230 FILE-2 pacemaker-based [8635]
> > > (cib_perform_op)  info: ++
> > > > > value="on"/>
> >
> >
> > Issue is quite intermittent and observed on other nodes as well.
> > We have seen a similar issue when we try to remove the node from
> > standby mode (using crm node online) command. One/more nodes fails to
> > get removed from standby mode.
> >
> > We suspect it could be an issue with parallel execution of node
> > standby/online command for all nodes but this issue wasn't observed
> > with pacemaker packaged with SLES15 SP2 OS.
> >
> > I'm attaching the pacemaker.log from FILE-2 for analysis. Let us know
> > if any additional information is required.
> >
> > OS: SLES15 SP4
> > Pacemaker version -->
> >  crmadmin --version
> > Pacemaker 2.1.2+20211124.ada5c3b36-150400.2.43
> >
> > Thanks,
> > Ayush
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker-fenced[11637]: warning: Can't create a sane reply

2022-06-22 Thread Priyanka Balotra

Hi Klaus,
The config is as follows:
There are 2  nodes in the setup and some resources configured (stonith, IP,
systemd services related).
Sorry, I can share only high level details for this.

- pacemaker version
# rpm -qa pacemaker

pacemaker-2.0.3+20200511.2b248d828-1.10.x86_64




# rpm -qa corosync

corosync-2.4.5-10.14.6.1.x86_64


 # rpm -qa crmsh

crmsh-4.2.0+git.1585096577.f3257c89-3.4.noarch


On Wed, Jun 22, 2022 at 5:45 PM Klaus Wenninger  wrote:

> On Wed, Jun 22, 2022 at 1:46 PM Priyanka Balotra
>  wrote:
> >
> > Hi All,
> >
> > We are seeing an issue where we performed cluster shutdown followed by
> cluster boot operation. All the nodes joined the cluster excet one (the
> first node). Here are some pacemaker logs around that timestamp:
> >
> > 2022-06-19T07:02:08.690213+00:00 FILE-1 pacemaker-fenced[11637]:
> notice: Operation 'off' targeting FILE-1 on FILE-2 for
> pacemaker-controld.11523@FILE-2.0b09e949: OK
> >
> > 2022-06-19T07:02:08.690604+00:00 FILE-1 pacemaker-fenced[11637]:  error:
> stonith_construct_reply: Triggered assert at fenced_commands.c:2363 :
> request != NULL
> >
> > 2022-06-19T07:02:08.690781+00:00 FILE-1 pacemaker-fenced[11637]:
> warning: Can't create a sane reply
> >
> > 2022-06-19T07:02:08.691872+00:00 FILE-1 pacemaker-controld[11643]:
> crit: We were allegedly just fenced by FILE-2 for FILE-2!
> >
> > 2022-06-19T07:02:08.693994+00:00 FILE-1 pacemakerd[11622]:  warning:
> Shutting cluster down because pacemaker-controld[11643] had fatal failure
> >
> > 2022-06-19T07:02:08.694209+00:00 FILE-1 pacemakerd[11622]:  notice:
> Shutting down Pacemaker
> >
> > 2022-06-19T07:02:08.694381+00:00 FILE-1 pacemakerd[11622]:  notice:
> Stopping pacemaker-schedulerd
> >
> >
> >
> > Let us know if you need any more logs to find an rca to this.
>
> A little bit more info about your configuration and the pacemaker-version
> (cib?)
> used would definitely be helpful.
>
> Klaus
> >
> > Thanks
> > Priyanka
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] pacemaker-fenced[11637]: warning: Can't create a sane reply

2022-06-22 Thread Priyanka Balotra

Hi All,

We are seeing an issue where we performed cluster shutdown followed by
cluster boot operation. All the nodes joined the cluster excet one (the
first node). Here are some pacemaker logs around that timestamp:

2022-06-19T07:02:08.690213+00:00 FILE-1 pacemaker-fenced[11637]:  notice:
Operation 'off' targeting FILE-1 on FILE-2 for
pacemaker-controld.11523@FILE-2.0b09e949: OK

2022-06-19T07:02:08.690604+00:00 FILE-1 pacemaker-fenced[11637]:  *error:
stonith_construct_reply: Triggered assert at fenced_commands.c:2363 :
request != NULL*

2022-06-19T07:02:08.690781+00:00 FILE-1 pacemaker-fenced[11637]:
warning: *Can't
create a sane reply*

2022-06-19T07:02:08.691872+00:00 FILE-1 pacemaker-controld[11643]:  crit:
We were allegedly just fenced by FILE-2 for FILE-2!

2022-06-19T07:02:08.693994+00:00 FILE-1 pacemakerd[11622]:  warning:
Shutting cluster down because pacemaker-controld[11643] had fatal failure

2022-06-19T07:02:08.694209+00:00 FILE-1 pacemakerd[11622]:  notice:
Shutting down Pacemaker

2022-06-19T07:02:08.694381+00:00 FILE-1 pacemakerd[11622]:  notice:
Stopping pacemaker-schedulerd


Let us know if you need any more logs to find an rca to this.

Thanks
Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] crm status shows CURRENT DC as None

2022-06-13 Thread Priyanka Balotra

Hi Folks,

crm status shows CURRENT DC as None. Please check and let us know why the
current DC is not pointing to any of the nodes.



*CRM Status:*

Cluster Summary:

  * Stack: corosync

*  * Current DC: NONE*

  * Last updated: Tue Jun  7 06:14:59 2022

  * Last change:  Tue Jun  7 05:29:40 2022 by root via cibadmin on FILE-2

  * 2 nodes configured

  * 9 resource instances configured


   - How the current DC will be set to any node once we see as *None*?
   - Is there any impact on cluster functionality?

Thanks
Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)

2022-03-23 Thread Priyanka Balotra

Hi All,



We have a scenario on SLES 12 SP3 cluster.

The scenario is explained as follows in the order of events:

-   There is a 2-node cluster (FILE-1, FILE-2)

-   The cluster and the resources were up and running fine initially .

-   Then fencing request from pacemaker got issued on both nodes
simultaneously



Logs from 1st node:

2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] Failed to
receive the leave message. failed: 2

.

.

2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: notice:
Requesting that FILE-1 perform 'off' action targeting FILE-2



Logs from 2nd node:

2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] Failed to
receive the leave message. failed: 1

.

.

Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith)
notice: Requesting that FILE-2 perform 'off' action targeting FILE-1



-   When the nodes came up after unfencing, the DC got set after
election

-   After that the resources which were expected to run on only one
node became active on both (all) nodes of the cluster.





27290 2022-02-22T04:16:31.699186+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource stonith-sbd is active on 2 nodes (attempting recovery)
27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27292 2022-02-22T04:16:31.699590+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource FILE_Filesystem is active on 2 nodes (attem pting recovery)
27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27294 2022-02-22T04:16:31.699878+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource IP_Floating is active on 2 nodes (attemptin g recovery)
27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27296 2022-02-22T04:16:31.700203+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource Service_Postgresql is active on 2 nodes (at tempting recovery)
27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-schedulerd[5018]:
error: Resource Service_Postgrest is active on 2 nodes (att empting
recovery)
27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27300 2022-02-22T04:16:31.700792+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource Service_esm_primary is active on 2 nodes (a ttempting recovery)
27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27302 2022-02-22T04:16:31.701086+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource Shared_Cluster_Backup is active on 2 nodes (attempting recovery)





Can you guys please help us understand if this is indeed a split-brain
scenario ? Under what circumstances can such a scenario be observed?

We can have very serious impact if such a case can re-occur inspite of
stonith already configured. Hence the ask .

In case this situation gets reproduced, how can it be handled?

Note: We have stonith configured and it has been working fine so far. In
this case also, the initial fencing happened from stonith only.



Thanks in advance!

Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Recall: Resources too_active (active on all nodes of the cluster, instead of only 1 node)

2022-03-23 Thread Balotra, Priyanka

Balotra, Priyanka would like to recall the message, "Resources too_active 
(active on all nodes of the cluster, instead of only 1 node)".
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)

2022-03-23 Thread Balotra, Priyanka

Hi All,

We have a scenario on SLES 12 SP3 cluster.
The scenario is explained as follows in the order of events:

  *   There is a 2-node cluster (FILE-1, FILE-2)
  *   The cluster and the resources were up and running fine initially .
  *   Then fencing request from pacemaker got issued on both nodes 
simultaneously

Logs from 1st node:
2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] Failed to 
receive the leave message. failed: 2
.
.
2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: notice: 
Requesting that FILE-1 perform 'off' action targeting FILE-2

Logs from 2nd node:
2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] Failed to 
receive the leave message. failed: 1
.
.
Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith) notice: 
Requesting that FILE-2 perform 'off' action targeting FILE-1


  *   When the nodes came up after unfencing, the DC got set after election
  *   After that the resources which were expected to run on only one node 
became active on both (all) nodes of the cluster.

27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker-schedulerd[5018]: 
error: Resource stonith-sbd is active on 2 nodes (attempting recovery)
27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-schedulerd[5018]: 
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for 
more information
27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker-schedulerd[5018]: 
error: Resource FILE_Filesystem is active on 2 nodes (attem pting recovery)
27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-schedulerd[5018]: 
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for 
more information
27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker-schedulerd[5018]: 
error: Resource IP_Floating is active on 2 nodes (attemptin g recovery)
27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-schedulerd[5018]: 
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for 
more information
27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker-schedulerd[5018]: 
error: Resource Service_Postgresql is active on 2 nodes (at tempting recovery)
27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-schedulerd[5018]: 
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for 
more information
27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-schedulerd[5018]: 
error: Resource Service_Postgrest is active on 2 nodes (att empting recovery)
27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-schedulerd[5018]: 
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for 
more information
27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker-schedulerd[5018]: 
error: Resource Service_esm_primary is active on 2 nodes (a ttempting recovery)
27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-schedulerd[5018]: 
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for 
more information
27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker-schedulerd[5018]: 
error: Resource Shared_Cluster_Backup is active on 2 nodes (attempting recovery)


Can you guys please help us understand if this is indeed a split-brain scenario 
? Under what circumstances can such a scenario be observed?
We can have very serious impact if such a case can re-occur inspite of stonith 
already configured. Hence the ask .
In case this situation gets reproduced, how can it be handled?

Note: We have stonith configured and it has been working fine so far. In this 
case also, the initial fencing happened from stonith only.

Thanks in advance!





Internal Use - Confidential
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Corosync+Pacemaker error during failover

2016-01-15 Thread priyanka

On 2015-10-08 21:20, Ken Gaillot wrote:

On 10/08/2015 10:16 AM, priyanka wrote:

Hi,

We are trying to build a HA setup for our servers using DRBD +
Corosync

+ pacemaker stack.

Attached is the configuration file for corosync/pacemaker and drbd.

A few things I noticed:

* Don't set become-primary-on in the DRBD configuration in a
Pacemaker

cluster; Pacemaker should handle all promotions to primary.

* I'm no NFS expert, but why is res_exportfs_root cloned? Can both
servers export it at the same time? I would expect it to be in the
group

before res_exportfs_export1.

We have followed following configuration guide for our setup,

https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_techguides/book_sleha_techguides.html

which suggests to create clone of this resource. This resource will not
export actual data, data is exported by exportfs_export1 resource in our
setup. I did try the previous fail-over scenario without cloning this
resource but same error appeared.

* Your constraints need some adjustment. Partly it depends on the
answer

to the previous question, but currently res_fs (via the group) is
ordered after res_exportfs_root, and I don't see how that could work.

We are getting errors while testing this setup.
1. When we stop corosync on Master machine say server1(lock), it is
Stonith'ed. In this case slave-server2(sher) is promoted to master.
But when server1(lock) reboots res_exportfs_export1 is started on
both the servers and that resource goes into failed state followed
by

servers going into unclean state.
Then server1(lock) reboots and server2(sher) is master but in
unclean

state. After server1(lock) comes up, server2(sher) is stonith'ed and
server1(lock) is slave(the only online node).
When server2(sher) comes up, both the servers are slaves and
resource

group(rg_export) is stopped. Then server2(sher) becomes Master and
server1(lock) is slave and resource group is started.
At this point configuration becomes stable.

PFA logs(syslog) of server2(sher) after it is promoted to master
till it

is first rebooted when resource exportfs goes into failed state.

Please let us know if the configuration is appropriate. From the
logs we

could not figure out exact reason of resource failure.
Your comment on this scenario will be very helpful.

Thanks,
Priyanka

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org

--
Regards,
Priyanka
MTech3 Sysad
IIT Powai

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync+Pacemaker error during failover

2016-01-15 Thread priyanka


On 2015-10-08 21:05, Digimer wrote:

On 08/10/15 11:16 AM, priyanka wrote:

fencing resource-only;


This needs to be 'fencing resource-and-stonith;'.
I did set the suggested parameter but error persists. Apparently node 
which comes back after fail-over is not able to detect res_exportfs_root 
on current master. Following is the log trace:



Jan 14 16:37:18 sher pengine[1383]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Jan 14 16:37:18 sher pengine[1383]:  warning: unpack_rsc_op: Processing 
failed op monitor for res_exportfs_root:0 on sher: not running (7)
Jan 14 16:37:18 sher pengine[1383]:  warning: unpack_rsc_op: Processing 
failed op monitor for fence_lock on sher: unknown error (1)
Jan 14 16:37:18 sher pengine[1383]:error: native_create_actions: 
Resource res_exportfs_export1 (ocf::exportfs) is active on 2 nodes 
attempting recovery
Jan 14 16:37:18 sher pengine[1383]:  warning: native_create_actions: 
See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more 
information.
Jan 14 16:37:18 sher pengine[1383]:   notice: LogActions: Start   
fence_sher#011(lock)
Jan 14 16:37:18 sher pengine[1383]:   notice: LogActions: Start   
res_drbd_export:1#011(lock)
Jan 14 16:37:18 sher pengine[1383]:   notice: LogActions: Restart 
res_exportfs_export1#011(Started sher)
Jan 14 16:37:18 sher pengine[1383]:   notice: LogActions: Start   
res_nfsserver:1#011(lock)
Jan 14 16:37:18 sher pengine[1383]:error: process_pe_message: 
Calculated Transition 7: /var/lib/pacemaker/pengine/pe-error-352.bz2
Jan 14 16:37:18 sher crmd[1384]:   notice: te_rsc_command: Initiating 
action 11: start fence_sher_start_0 on lock
Jan 14 16:37:18 sher crmd[1384]:   notice: te_rsc_command: Initiating 
action 50: stop res_exportfs_export1_stop_0 on lock
Jan 14 16:37:18 sher crmd[1384]:   notice: te_rsc_command: Initiating 
action 49: stop res_exportfs_export1_stop_0 on sher (local)
Jan 14 16:37:18 sher crmd[1384]:   notice: te_rsc_command: Initiating 
action 68: monitor res_exportfs_root_monitor_3 on lock
Jan 14 16:37:18 sher crmd[1384]:   notice: te_rsc_command: Initiating 
action 76: notify res_drbd_export_pre_notify_start_0 on sher (local)
Jan 14 16:37:18 sher crmd[1384]:   notice: te_rsc_command: Initiating 
action 58: start res_nfsserver_start_0 on lock



I have pacemaker 1.1.10 installed in my setup, should I try upgrade?

--
Regards,
Priyanka


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync+Pacemaker error during failover

2016-01-15 Thread priyanka


On 2015-10-08 20:52, emmanuel segura wrote:

please check if you drbd is configured to call fence-handler
https://drbd.linbit.com/users-guide/s-pacemaker-fencing.html


yes.


2015-10-08 17:16 GMT+02:00 priyanka :

Hi,

We are trying to build a HA setup for our servers using DRBD + 
Corosync +

pacemaker stack.

Attached is the configuration file for corosync/pacemaker and drbd.

We are getting errors while testing this setup.
1. When we stop corosync on Master machine say server1(lock), it is
Stonith'ed. In this case slave-server2(sher) is promoted to master.
   But when server1(lock) reboots res_exportfs_export1 is started on 
both
the servers and that resource goes into failed state followed by 
servers

going into unclean state.
   Then server1(lock) reboots and server2(sher) is master but in 
unclean

state. After server1(lock) comes up, server2(sher) is stonith'ed and
server1(lock) is slave(the only online node).
   When server2(sher) comes up, both the servers are slaves and 
resource

group(rg_export) is stopped. Then server2(sher) becomes Master and
server1(lock) is slave and resource group is started.
   At this point configuration becomes stable.


PFA logs(syslog) of server2(sher) after it is promoted to master 
till it is

first rebooted when resource exportfs goes into failed state.

Please let us know if the configuration is appropriate. From the 
logs we

could not figure out exact reason of resource failure.
Your comment on this scenario will be very helpful.

Thanks,
Priyanka


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org



--
Regards,
Priyanka
MTech3 Sysad
IIT Powai

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Corosync+Pacemaker error during failover

2015-10-08 Thread priyanka


Hi,

We are trying to build a HA setup for our servers using DRBD + Corosync 
+ pacemaker stack.


Attached is the configuration file for corosync/pacemaker and drbd.

We are getting errors while testing this setup.
1. When we stop corosync on Master machine say server1(lock), it is 
Stonith'ed. In this case slave-server2(sher) is promoted to master.
   But when server1(lock) reboots res_exportfs_export1 is started on 
both the servers and that resource goes into failed state followed by 
servers going into unclean state.
   Then server1(lock) reboots and server2(sher) is master but in 
unclean state. After server1(lock) comes up, server2(sher) is stonith'ed 
and server1(lock) is slave(the only online node).
   When server2(sher) comes up, both the servers are slaves and 
resource group(rg_export) is stopped. Then server2(sher) becomes Master 
and server1(lock) is slave and resource group is started.

   At this point configuration becomes stable.


PFA logs(syslog) of server2(sher) after it is promoted to master till 
it is first rebooted when resource exportfs goes into failed state.


Please let us know if the configuration is appropriate. From the logs 
we could not figure out exact reason of resource failure.

Your comment on this scenario will be very helpful.

Thanks,
Priyanka

sher(new master) =>

Oct  8 18:01:20 sher kernel: [  886.867496] e1000e: eth0 NIC Link is Down
Oct  8 18:01:22 sher exportfs(res_exportfs_root)[5566]: INFO: Directory 
/mnt/vms is exported to 10.105.0.0/255.255.0.0 (started).
Oct  8 18:01:27 sher exportfs(res_exportfs_export1)[5580]: INFO: Directory 
/mnt/vms/export1 is exported to 10.105.0.0/255.255.0.0 (started).
Oct  8 18:01:30 sher kernel: [  896.771854] e1000e: eth0 NIC Link is Up 1000 
Mbps Full Duplex, Flow Control: Rx/Tx
Oct  8 18:01:30 sher corosync[1320]:  [TOTEM ] A new membership 
(192.168.0.21:3444) was formed. Members joined: 102
Oct  8 18:01:30 sher crmd[1465]:error: pcmk_cpg_membership: Node lock[102] 
appears to be online even though we think it is dead
Oct  8 18:01:30 sher crmd[1465]:   notice: crm_update_peer_state: 
pcmk_cpg_membership: Node lock[102] - state is now member (was lost)
Oct  8 18:01:30 sher corosync[1320]:  [QUORUM] Members[2]: 101 102
Oct  8 18:01:30 sher corosync[1320]:  [MAIN  ] Completed service 
synchronization, ready to provide service.
Oct  8 18:01:30 sher pacemakerd[1458]:   notice: crm_update_peer_state: 
pcmk_quorum_notification: Node lock[102] - state is now member (was lost)
Oct  8 18:01:32 sher exportfs(res_exportfs_root)[5652]: INFO: Directory 
/mnt/vms is exported to 10.105.0.0/255.255.0.0 (started).
Oct  8 18:01:37 sher exportfs(res_exportfs_export1)[5666]: INFO: Directory 
/mnt/vms/export1 is exported to 10.105.0.0/255.255.0.0 (started).
Oct  8 18:01:43 sher exportfs(res_exportfs_root)[5738]: INFO: Directory 
/mnt/vms is exported to 10.105.0.0/255.255.0.0 (started).
Oct  8 18:01:47 sher exportfs(res_exportfs_export1)[5779]: INFO: Directory 
/mnt/vms/export1 is exported to 10.105.0.0/255.255.0.0 (started).
Oct  8 18:01:49 sher crmd[1465]:   notice: do_state_transition: State 
transition S_IDLE -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL 
origin=do_election_count_vote ]
Oct  8 18:01:49 sher crmd[1465]:   notice: do_state_transition: State 
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Oct  8 18:01:49 sher attrd[1463]:   notice: attrd_local_callback: Sending full 
refresh (origin=crmd)
Oct  8 18:01:49 sher attrd[1463]:   notice: attrd_trigger_update: Sending flush 
op to all hosts for: fail-count-res_exportfs_root (2)
Oct  8 18:01:49 sher attrd[1463]:   notice: attrd_trigger_update: Sending flush 
op to all hosts for: master-res_drbd_export (1)
Oct  8 18:01:49 sher attrd[1463]:   notice: attrd_trigger_update: Sending flush 
op to all hosts for: last-failure-res_exportfs_root (1444306659)
Oct  8 18:01:49 sher attrd[1463]:   notice: attrd_trigger_update: Sending flush 
op to all hosts for: probe_complete (true)
Oct  8 18:01:49 sher attrd[1463]:   notice: attrd_trigger_update: Sending flush 
op to all hosts for: fail-count-fence_lock (INFINITY)
Oct  8 18:01:49 sher attrd[1463]:   notice: attrd_trigger_update: Sending flush 
op to all hosts for: last-failure-fence_lock (1444306771)
Oct  8 18:01:50 sher pengine[1464]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Oct  8 18:01:50 sher pengine[1464]:  warning: unpack_rsc_op: Processing failed 
op monitor for res_exportfs_root:0 on sher: not running (7)
Oct  8 18:01:50 sher pengine[1464]:  warning: unpack_rsc_op: Processing failed 
op start for fence_lock on sher: unknown error (1)
Oct  8 18:01:50 sher pengine[1464]:  warning: common_apply_stickiness: Forcing 
fence_lock away from sher after 100 failures (max=100)
Oct  8 18:01:50 sher pengine[1464]:   notice: LogActions: Start   
fence_sher#011(lock)
Oct  8 18:01:50 sher pengine[1464]:   notice: LogActions: Start

Re: [ClusterLabs] Pacemaker fatal shutdown

[ClusterLabs] Pacemaker fatal shutdown

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

[ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

Re: [ClusterLabs] crm node stays online after issuing node standby command

Re: [ClusterLabs] pacemaker-fenced[11637]: warning: Can't create a sane reply

[ClusterLabs] pacemaker-fenced[11637]: warning: Can't create a sane reply

[ClusterLabs] crm status shows CURRENT DC as None

[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)

[ClusterLabs] Recall: Resources too_active (active on all nodes of the cluster, instead of only 1 node)

[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)

Re: [ClusterLabs] Corosync+Pacemaker error during failover

Re: [ClusterLabs] Corosync+Pacemaker error during failover

Re: [ClusterLabs] Corosync+Pacemaker error during failover

[ClusterLabs] Corosync+Pacemaker error during failover

17 matches

Site Navigation

Mail list logo

Footer information