Re: [ClusterLabs] Resources restart when a node joins in

2020-08-28 Thread Reid Wahl
No problem! That's what we're here for. I'm glad it's sorted out :)

On Fri, Aug 28, 2020 at 12:27 AM Citron Vert 
wrote:

> Hi,
>
> You are right, the problems seem to come from some services that are
> started at startup.
>
> My installation script disables all startup options for all services we
> use, that's why I didn't focus on this possibility.
>
> But after a quick investigation, a colleague had the good idea to make a
> "security" script that monitors and starts certain services.
>
>
> Sorry to have contacted you for this little mistake,
>
> Thank you for the help, it was effective
>
> Quentin
>
>
>
> Le 27/08/2020 à 09:56, Reid Wahl a écrit :
>
> Hi, Quentin. Thanks for the logs!
>
> I see you highlighted the fact that SERVICE1 was in "Stopping" state on
> both node 1 and node 2 when node 1 was rejoining the cluster. I also noted
> the following later in the logs, as well as some similar messages earlier:
>
> Aug 27 08:47:02 [1330] NODE2pengine: info: determine_op_status:   
> Operation monitor found resource SERVICE1 active on NODE1
> Aug 27 08:47:02 [1330] NODE2pengine: info: determine_op_status:   
> Operation monitor found resource SERVICE1 active on NODE1
> Aug 27 08:47:02 [1330] NODE2pengine: info: determine_op_status:   
> Operation monitor found resource SERVICE4 active on NODE2
> Aug 27 08:47:02 [1330] NODE2pengine: info: determine_op_status:   
> Operation monitor found resource SERVICE1 active on NODE2
> ...
> Aug 27 08:47:02 [1330] NODE2pengine: info: common_print:  
> 1 : NODE1
> Aug 27 08:47:02 [1330] NODE2pengine: info: common_print:  
> 2 : NODE2
> ...
> Aug 27 08:47:02 [1330] NODE2pengine:error: native_create_actions: 
> Resource SERVICE1 is active on 2 nodes (attempting recovery)
> Aug 27 08:47:02 [1330] NODE2pengine:   notice: native_create_actions: 
> See https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more 
> information
>
>
> Can you make sure that all the cluster-managed systemd services are disabled 
> from starting at boot (i.e., `systemctl is-enabled service1`, and the same 
> for all the others) on both nodes? If they are enabled, disable them.
>
>
> On Thu, Aug 27, 2020 at 12:46 AM Citron Vert 
> wrote:
>
>> Hi,
>>
>> Sorry for using this email adress, my name is Quentin. Thank you for your
>> reply.
>>
>> I have already tried the stickiness solution (with the deprecated
>> value). I tried the one you gave me, and it does not change anything.
>>
>> Resources don't seem to move from node to node (i don't see the changes
>> with crm_mon command).
>>
>>
>> In the logs i found this line *"error: native_create_actions:
>> Resource SERVICE1 is active on 2 nodes*"
>>
>> Which led me to contact you to understand and learn a little more about
>> this cluster. And why there are running resources on the passive node.
>>
>>
>> You will find attached the logs during the reboot of the passive node and
>> my cluster configuration.
>>
>> I think I'm missing out on something in the configuration / logs that I
>> don't understand..
>>
>>
>> Thank you in advance for your help,
>>
>> Quentin
>>
>>
>> Le 26/08/2020 à 20:16, Reid Wahl a écrit :
>>
>> Hi, Citron.
>>
>> Based on your description, it sounds like some resources **might** be
>> moving from node 1 to node 2, failing on node 2, and then moving back to
>> node 1. If that's what's happening (and even if it's not), then it's
>> probably smart to set some resource stickiness as a resource default. The
>> below command sets a resource stickiness score of 1.
>>
>> # pcs resource defaults resource-stickiness=1
>>
>> Also note that the "default-resource-stickiness" cluster property is
>> deprecated and should not be used.
>>
>> Finally, an explicit default resource stickiness score of 0 can interfere
>> with the placement of cloned resource instances. If you don't want any
>> stickiness, then it's better to leave stickiness unset. That way,
>> primitives will have a stickiness of 0, but clone instances will have a
>> stickiness of 1.
>>
>> If adding stickiness does not resolve the issue, can you share your
>> cluster configuration and some logs that show the issue happening? Off the
>> top of my head I'm not sure why resources would start and stop on node 2
>> without moving away from node1, unless they're clone instances that are
>> starting and then fail

Re: [ClusterLabs] Resources restart when a node joins in

2020-08-26 Thread Reid Wahl
Hi, Citron.

Based on your description, it sounds like some resources **might** be
moving from node 1 to node 2, failing on node 2, and then moving back to
node 1. If that's what's happening (and even if it's not), then it's
probably smart to set some resource stickiness as a resource default. The
below command sets a resource stickiness score of 1.

# pcs resource defaults resource-stickiness=1

Also note that the "default-resource-stickiness" cluster property is
deprecated and should not be used.

Finally, an explicit default resource stickiness score of 0 can interfere
with the placement of cloned resource instances. If you don't want any
stickiness, then it's better to leave stickiness unset. That way,
primitives will have a stickiness of 0, but clone instances will have a
stickiness of 1.

If adding stickiness does not resolve the issue, can you share your cluster
configuration and some logs that show the issue happening? Off the top of
my head I'm not sure why resources would start and stop on node 2 without
moving away from node1, unless they're clone instances that are starting
and then failing a monitor operation on node 2.

On Wed, Aug 26, 2020 at 8:42 AM Citron Vert  wrote:

> Hello,
> I am contacting you because I have a problem with my cluster and I cannot
> find (nor understand) any information that can help me.
>
> I have a 2 nodes cluster (pacemaker, corosync, pcs) installed on CentOS 7
> with a set of configuration.
> Everything seems to works fine, but here is what happens:
>
>- Node1 and Node2 are running well with Node1 as primary
>- I reboot Node2 wich is passive (no changes on Node1)
>- Node2 comes back in the cluster as passive
>- corosync logs shows resources getting started then stopped on Node2
>- "crm_mon" command shows some ressources on Node1 getting restarted
>
> I don't understand how it should work.
> If a node comes back, and becomes passive (since Node1 is running
> primary), there is no reason for the resources to be started then stopped
> on the new passive node ?
>
> One of my resources becomes unstable because it gets started and then
> stoped too quickly on Node2, wich seems to make it restart on Node1 without
> a failover.
>
> I tried several things and solution proposed by different sites and forums
> but without success.
>
>
> Is there a way so that the node, which joins the cluster as passive, does
> not start its own resources ?
>
>
> thanks in advance
>
>
> Here are some information just in case :
> $ rpm -qa | grep -E "corosync|pacemaker|pcs"
> corosync-2.4.5-4.el7.x86_64
> pacemaker-cli-1.1.21-4.el7.x86_64
> pacemaker-1.1.21-4.el7.x86_64
> pcs-0.9.168-4.el7.centos.x86_64
> corosynclib-2.4.5-4.el7.x86_64
> pacemaker-libs-1.1.21-4.el7.x86_64
> pacemaker-cluster-libs-1.1.21-4.el7.x86_64
>
>
>  "stonith-enabled" value="false"/>
>  "no-quorum-policy" value="ignore"/>
>   value="120s"/>
>  "have-watchdog" value="false"/>
>   value="1.1.21-4.el7-f14e36fd43"/>
>  "cluster-infrastructure" value="corosync"/>
>  "cluster-name" value="CLUSTER"/>
>  "last-lrm-refresh" value="1598446314"/>
>   name="default-resource-stickiness" value="0"/>
>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Resources restart when a node joins in

2020-08-27 Thread Reid Wahl
Hi, Quentin. Thanks for the logs!

I see you highlighted the fact that SERVICE1 was in "Stopping" state on
both node 1 and node 2 when node 1 was rejoining the cluster. I also noted
the following later in the logs, as well as some similar messages earlier:

Aug 27 08:47:02 [1330] NODE2pengine: info:
determine_op_status:   Operation monitor found resource SERVICE1
active on NODE1
Aug 27 08:47:02 [1330] NODE2pengine: info:
determine_op_status:   Operation monitor found resource SERVICE1
active on NODE1
Aug 27 08:47:02 [1330] NODE2pengine: info:
determine_op_status:   Operation monitor found resource SERVICE4
active on NODE2
Aug 27 08:47:02 [1330] NODE2pengine: info:
determine_op_status:   Operation monitor found resource SERVICE1
active on NODE2
...
Aug 27 08:47:02 [1330] NODE2pengine: info: common_print:
   1 : NODE1
Aug 27 08:47:02 [1330] NODE2pengine: info: common_print:
   2 : NODE2
...
Aug 27 08:47:02 [1330] NODE2pengine:error:
native_create_actions: Resource SERVICE1 is active on 2 nodes
(attempting recovery)
Aug 27 08:47:02 [1330] NODE2pengine:   notice:
native_create_actions: See
https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
information

Can you make sure that all the cluster-managed systemd services are
disabled from starting at boot (i.e., `systemctl is-enabled service1`,
and the same for all the others) on both nodes? If they are enabled,
disable them.


On Thu, Aug 27, 2020 at 12:46 AM Citron Vert 
wrote:

> Hi,
>
> Sorry for using this email adress, my name is Quentin. Thank you for your
> reply.
>
> I have already tried the stickiness solution (with the deprecated  value).
> I tried the one you gave me, and it does not change anything.
>
> Resources don't seem to move from node to node (i don't see the changes
> with crm_mon command).
>
>
> In the logs i found this line *"error: native_create_actions:
> Resource SERVICE1 is active on 2 nodes*"
>
> Which led me to contact you to understand and learn a little more about
> this cluster. And why there are running resources on the passive node.
>
>
> You will find attached the logs during the reboot of the passive node and
> my cluster configuration.
>
> I think I'm missing out on something in the configuration / logs that I
> don't understand..
>
>
> Thank you in advance for your help,
>
> Quentin
>
>
> Le 26/08/2020 à 20:16, Reid Wahl a écrit :
>
> Hi, Citron.
>
> Based on your description, it sounds like some resources **might** be
> moving from node 1 to node 2, failing on node 2, and then moving back to
> node 1. If that's what's happening (and even if it's not), then it's
> probably smart to set some resource stickiness as a resource default. The
> below command sets a resource stickiness score of 1.
>
> # pcs resource defaults resource-stickiness=1
>
> Also note that the "default-resource-stickiness" cluster property is
> deprecated and should not be used.
>
> Finally, an explicit default resource stickiness score of 0 can interfere
> with the placement of cloned resource instances. If you don't want any
> stickiness, then it's better to leave stickiness unset. That way,
> primitives will have a stickiness of 0, but clone instances will have a
> stickiness of 1.
>
> If adding stickiness does not resolve the issue, can you share your
> cluster configuration and some logs that show the issue happening? Off the
> top of my head I'm not sure why resources would start and stop on node 2
> without moving away from node1, unless they're clone instances that are
> starting and then failing a monitor operation on node 2.
>
> On Wed, Aug 26, 2020 at 8:42 AM Citron Vert 
> wrote:
>
>> Hello,
>> I am contacting you because I have a problem with my cluster and I cannot
>> find (nor understand) any information that can help me.
>>
>> I have a 2 nodes cluster (pacemaker, corosync, pcs) installed on CentOS 7
>> with a set of configuration.
>> Everything seems to works fine, but here is what happens:
>>
>>- Node1 and Node2 are running well with Node1 as primary
>>- I reboot Node2 wich is passive (no changes on Node1)
>>- Node2 comes back in the cluster as passive
>>- corosync logs shows resources getting started then stopped on Node2
>>- "crm_mon" command shows some ressources on Node1 getting restarted
>>
>> I don't understand how it should work.
>> If a node comes back, and becomes passive (since Node1 is running
>> primary), there is no reason for the resources to be started then stopped
>> on the new passive node ?
>>
>> One of my resources becomes unstable because it gets star

Re: [ClusterLabs] VirtualDomain stop operation traced - but nothing appears in /var/lib/heartbeat/trace_ra/

2020-09-29 Thread Reid Wahl
If you set trace_ra=1 as an instance attribute of the primitive
(rather than of the stop operation), it should capture a trace of all
operations, including stop. I know this isn't exactly what you're
asking, since tracing every operation will result in more logs.

With that being said, I tested on an ocf:heartbeat:Dummy resource.
When I added trace_ra=1 as an instance attribute of the monitor
operation, it **did** capture a trace of recurring monitors. When I
added trace_ra=1 as an instance attribute of the stop operation, it
**did not** capture a trace of the stop.

I don't know off the top of my head how Pacemaker handles
 that are nested within an , so I'm not sure
why there's a discrepancy between monitor and stop operations.

On Mon, Sep 28, 2020 at 12:43 PM Lentes, Bernd
 wrote:
>
> Hi,
>
> currently i have a VirtualDomains resource which sometimes fails to stop.
> To investigate further i'm tracing the stop operation of this resource.
> But although i stopped it already now several times, nothing appears in 
> /var/lib/heartbeat/trace_ra/.
>
> This is my config:
> primitive vm_amok VirtualDomain \
> params config="/mnt/share/vm_amok.xml" \
> params hypervisor="qemu:///system" \
> params migration_transport=ssh \
> params migrate_options="--p2p --tunnelled" \
> op start interval=0 timeout=120 \
> op monitor interval=30 timeout=25 \
> op migrate_from interval=0 timeout=300 \
> op migrate_to interval=0 timeout=300 \
> op stop interval=0 timeout=180 \
> op_params trace_ra=1 \
> meta allow-migrate=true target-role=Started is-managed=true 
> maintenance=false \
>
>   type="VirtualDomain">
> 
>id="vm_amok-instance_attributes-config"/>
> 
> 
>id="vm_amok-instance_attributes-0-hypervisor"/>
> 
> 
>id="vm_amok-instance_attributes-1-migration_transport"/>
> 
> 
>id="vm_amok-instance_attributes-2-migrate_options"/>
> 
> 
>   
>id="vm_amok-monitor-30"/>
>id="vm_amok-migrate_from-0"/>
>id="vm_amok-migrate_to-0"/>
>   
> 
>id="vm_amok-stop-0-instance_attributes-trace_ra"/>
> 
>   
> 
>
> Any ideas ?
> SLES 12 SP4, pacemaker-1.1.19+20181105.ccd6b5b10-3.13.1.x86_64
>
> Bernd
>
> --
>
> Bernd Lentes
> Systemadministration
> Institute for Metabolism and Cell Death (MCD)
> Building 25 - office 122
> HelmholtzZentrum München
> bernd.len...@helmholtz-muenchen.de
> phone: +49 89 3187 1241
> phone: +49 89 3187 3827
> fax: +49 89 3187 2294
> http://www.helmholtz-muenchen.de/mcd
>
> stay healthy
> Helmholtz Zentrum München
>
> Helmholtz Zentrum München
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Pacemaker not starting

2020-09-23 Thread Reid Wahl
Please also share /etc/cluster/cluster.conf. Do you have `two_node="1"
expected_votes="1"` in the  element of cluster.conf?

This is technically a cman startup issue. Pacemaker is waiting for
cman to start first and form quorum through corosync first.

On Wed, Sep 23, 2020 at 9:55 AM Strahil Nikolov  wrote:
>
> What is the output of 'corosync-quorumtool -s' on both nodes ?
> What is your cluster's configuration :
>
> 'crm configure show' or 'pcs config'
>
>
> Best Regards,
> Strahil Nikolov
>
>
>
>
>
>
> В сряда, 23 септември 2020 г., 16:07:16 Гринуич+3, Ambadas Kawle 
>  написа:
>
>
>
>
>
> Hello All
>
> We have 2 node with Mysql cluster and we are not able to start pacemaker on 
> one of the node (slave node)
> We are getting error "waiting for quorum... timed-out waiting for cluster"
>
> Following are package detail
> pacemaker pacemaker-1.1.15-5.el6.x86_64
> pacemaker-libs-1.1.15-5.el6.x86_64
> pacemaker-cluster-libs-1.1.15-5.el6.x86_64
> pacemaker-cli-1.1.15-5.el6.x86_64
>
> Corosync corosync-1.4.7-6.el6.x86_64
> corosynclib-1.4.7-6.el6.x86_64
>
> Mysql mysql-5.1.73-7.el6.x86_64
> "mysql-connector-python-2.0.4-1.el6.noarch
>
> Your help is appreciatedThanks Ambadas kawle
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
> _______
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Determine a resource's current host in the CIB

2020-09-24 Thread Reid Wahl
**Directly via the CIB**, I don't see a more obvious way than looking
for the most recent (perhaps by last-rc-change) successful
(rc-code="0" or rc-code="8") monitor operation. That might be
error-prone. I haven't looked into exactly how crm_simulate parses
resource status from the CIB XML yet. Others on the list might know.

Is there a particular reason why you need to parse the status directly
from the CIB, as opposed to using other tools? Does your use case
allow you to use crm_simulate with the cib.xml as input? (e.g.,
`crm_simulate --xml-file=`)

On Wed, Sep 23, 2020 at 11:04 PM Digimer  wrote:
>
> Hi all,
>
>   I'm trying to parse the CIB to determine which node a given resource
> (VM) is currently running on. I notice that the 'monitor' shows in both
> node's status element (from when it last ran when the node previously
> hosted the resource).
>
> https://pastebin.com/6RCMWdgq
>
> Specifically, I see under node 1 (the active host when the CIB was read):
>
>  operation_key="srv07-el6_monitor_6" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.3.0"
> transition-key="23:85:0:829209fd-35f2-4626-a9cd-f8a50a62871e"
> transition-magic="0:0;23:85:0:829209fd-35f2-4626-a9cd-f8a50a62871e"
> exit-reason="" on_node="mk-a02n01" call-id="76" rc-code="0"
> op-status="0" interval="6" last-rc-change="1600925201"
> exec-time="541" queue-time="0"
> op-digest="65d0f0c9227f2593835f5de6c9cb9d0e"/>
>
> And under node 2 (hosted the server in the past):
>
>  operation_key="srv07-el6_monitor_6" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.3.0"
> transition-key="23:83:0:829209fd-35f2-4626-a9cd-f8a50a62871e"
> transition-magic="0:0;23:83:0:829209fd-35f2-4626-a9cd-f8a50a62871e"
> exit-reason="" on_node="mk-a02n02" call-id="61" rc-code="0"
> op-status="0" interval="6" last-rc-change="1600925173"
> exec-time="539" queue-time="0"
> op-digest="65d0f0c9227f2593835f5de6c9cb9d0e"/>
>
> I don't see any specific entry in the CIB saying "resource X is
> currently hosted on node Y", so I assume I should infer which node is
> the current host? If so, should I look at which node's 'exec-time' is
> higher, or which node has the higher 'call-id'?
>
> Or am I missing a more obvious way to tell what resource is running on
> which node?
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocfs2 + pacemaker

2020-09-22 Thread Reid Wahl
s/oc2b/o2cb/

On Tue, Sep 22, 2020 at 9:30 PM Reid Wahl  wrote:
>
> I'm unable to find where to get the ocf:ocfs2:oc2b resource agent. I
> suspect you have to get it directly from Oracle somehow. Since the
> provider is "ocfs2" rather than "heartbeat' or "pacemaker", it may not
> be part of the ClusterLabs project.
>
> On Tue, Sep 22, 2020 at 2:26 PM Michael Ivanov  wrote:
> >
> > Hallo,
> >
> > I am trying to get ocfs2 running under pacemaker. The description I found at
> > https://wiki.clusterlabs.org/wiki/Dual_Primary_DRBD_+_OCFS2#The_o2cb_Service
> > refers to ocf:ocfs2:o2cb resource. But I cannot find it anywhere.
> >
> > I'm building the cluster using debian/testing (pacemaker 2.0.4, ocfs-tools 
> > 1.8.6)
> >
> > Best regards,
> > --
> >  \   / |   |
> >  (OvO) |  Mikhail Iwanow   |
> >  (^^^) |   |
> >   \^/  |  E-mail:  iv...@logit-ag.de   |
> >   ^ ^  |   |
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA



-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocfs2 + pacemaker

2020-09-22 Thread Reid Wahl
I'm unable to find where to get the ocf:ocfs2:oc2b resource agent. I
suspect you have to get it directly from Oracle somehow. Since the
provider is "ocfs2" rather than "heartbeat' or "pacemaker", it may not
be part of the ClusterLabs project.

On Tue, Sep 22, 2020 at 2:26 PM Michael Ivanov  wrote:
>
> Hallo,
>
> I am trying to get ocfs2 running under pacemaker. The description I found at
> https://wiki.clusterlabs.org/wiki/Dual_Primary_DRBD_+_OCFS2#The_o2cb_Service
> refers to ocf:ocfs2:o2cb resource. But I cannot find it anywhere.
>
> I'm building the cluster using debian/testing (pacemaker 2.0.4, ocfs-tools 
> 1.8.6)
>
> Best regards,
> --
>  \   / |   |
>  (OvO) |  Mikhail Iwanow   |
>  (^^^) |   |
>   \^/  |  E-mail:  iv...@logit-ag.de   |
>   ^ ^  |   |
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Maximum cluster size with Pacemaker 2.x and Corosync 3.x, and scaling to hundreds of nodes

2020-07-29 Thread Reid Wahl
Addressing only the first paragraph of your message, inline below. I'll
have to defer to others to answer the remainder.

On Wed, Jul 29, 2020 at 4:12 PM Toby Haynes  wrote:

> In Corosync 1.x there was a limit on the maximum number of active nodes in
> a corosync cluster - broswing the mailing list says 64 hosts. The Pacemaker
> 1.1 documentation says scalability goes up to 16 nodes. The Pacemaker 2.0
> documentation says the same, although I can't find a maximum number of
> nodes in Corosync 3.
>

I'm assuming that you're referring to the Pacemaker Remote document, as I
can't find any reference to 16 nodes in the other ClusterLabs docs.

Red Hat supports clusters with up to 32 full nodes as of RHEL 8.1. That
didn't require any change to corosync; it already worked and simply had to
be verified. So the Pacemaker Remote doc may need an update to say 32 nodes.

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Remote/
> discusses deployments up to 64 hosts but it appears to reference Pacemaker
> 1.16.
>
> With the arrival of Corossync 3.x (and Pacemaker 2.x) how large a cluster
> can be supported? If we want to get to a cluster with 100+ nodes, what are
> the best design approaches, especially if there is no clear hierarchy to
> the nodes in use (i.e. all of the hosts are important!).
>
> Are there performance implications when comparing the operation of a
> pacemaker remote node to a full stack pacemaker node?
>
> Thanks,
>
> Toby Haynes
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Reid Wahl
I don't know of a stonith method that acts upon a filesystem directly.
You'd generally want to act upon the power state of the node or upon the
underlying shared storage.

What kind of hardware or virtualization platform are these systems running
on? If there is a hardware watchdog timer, then sbd is possible. The
fence_sbd agent (poison-pill fencing via block device) requires shared
block storage, but sbd itself only requires a hardware watchdog timer.

Additionally, there may be an existing fence agent that can connect to the
controller you mentioned. What kind of controller is it?

On Wed, Jul 29, 2020 at 5:24 AM Gabriele Bulfon  wrote:

> Thanks a lot for the extensive explanation!
> Any idea about a ZFS stonith?
>
> Gabriele
>
>
>
> *Sonicle S.r.l. *: http://www.sonicle.com
> *Music: *http://www.gabrielebulfon.com
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>
> ------
>
>
> *Da:* Reid Wahl 
> *A:* Cluster Labs - All topics related to open-source clustering welcomed
> 
> *Data:* 29 luglio 2020 11.39.35 CEST
> *Oggetto:* Re: [ClusterLabs] Antw: [EXT] Stonith failing
>
>
> "As it stated in the comments, we don't want to halt or boot via ssh, only
> reboot."
>
> Generally speaking, a stonith reboot action consists of the following
> basic sequence of events:
>
>1. Execute the fence agent with the "off" action.
>2. Poll the power status of the fenced node until it is powered off.
>3. Execute the fence agent with the "on" action.
>4. Poll the power status of the fenced node until it is powered on.
>
> So a custom fence agent that supports reboots, actually needs to support
> off and on actions.
>
>
> As Andrei noted, ssh is **not** a reliable method by which to ensure a
> node gets rebooted or stops using cluster-managed resources. You can't
> depend on the ability to SSH to an unhealthy node that needs to be fenced.
>
> The only way to guarantee that an unhealthy or unresponsive node stops all
> access to shared resources is to power off or reboot the node. (In the case
> of resources that rely on shared storage, I/O fencing instead of power
> fencing can also work, but that's not ideal.)
>
> As others have said, SBD is a great option. Use it if you can. There are
> also power fencing methods (one example is fence_ipmilan, but the options
> available depend on your hardware or virt platform) that are reliable under
> most circumstances.
>
> You said that when you stop corosync on node 2, Pacemaker tries to fence
> node 2. There are a couple of possible reasons for that. One possibility is
> that you stopped or killed corosync without stopping Pacemaker first. (If
> you use pcs, then try `pcs cluster stop`.) Another possibility is that
> resources failed to stop during cluster shutdown on node 2, causing node 2
> to be fenced.
>
> On Wed, Jul 29, 2020 at 12:47 AM Andrei Borzenkov 
> wrote:
>
>>
>>
>> On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon 
>> wrote:
>>
>>> That one was taken from a specific implementation on Solaris 11.
>>> The situation is a dual node server with shared storage controller: both
>>> nodes see the same disks concurrently.
>>> Here we must be sure that the two nodes are not going to import/mount
>>> the same zpool at the same time, or we will encounter data corruption:
>>>
>>
>> ssh based "stonith" cannot guarantee it.
>>
>>
>>> node 1 will be perferred for pool 1, node 2 for pool 2, only in case one
>>> of the node goes down or is taken offline the resources should be first
>>> free by the leaving node and taken by the other node.
>>>
>>> Would you suggest one of the available stonith in this case?
>>>
>>>
>>
>> IPMI, managed PDU, SBD ...
>> In practice, the only stonith method that works in case of complete node
>> outage including any power supply is SBD.
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
>
> ___
> Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Reid Wahl
On Wed, Jul 29, 2020 at 10:45 PM Strahil Nikolov 
wrote:

> You got plenty of options:
> -  IPMI based  fencing like  HP iLO,  DELL iDRAC
> -  SCSI-3  persistent reservations (which can be extended to fence  the
> node when the reservation(s)  were  removed)
>
> - Shared  disk (even iSCSI)  and using SBD (a.k.a. Poison pill) -> in case
> your hardware has  no watchdog,  you can use  softdog  kernel module  for
> linux.
>

Although softdog may not be reliable in all circumstances.

Best  Regards,
> Strahil Nikolov
>
> На 29 юли 2020 г. 9:01:22 GMT+03:00, Gabriele Bulfon 
> написа:
> >That one was taken from a specific implementation on Solaris 11.
> >The situation is a dual node server with shared storage controller:
> >both nodes see the same disks concurrently.
> >Here we must be sure that the two nodes are not going to import/mount
> >the same zpool at the same time, or we will encounter data corruption:
> >node 1 will be perferred for pool 1, node 2 for pool 2, only in case
> >one of the node goes down or is taken offline the resources should be
> >first free by the leaving node and taken by the other node.
> >
> >Would you suggest one of the available stonith in this case?
> >
> >Thanks!
> >Gabriele
> >
> >
> >
> >Sonicle S.r.l.
> >:
> >http://www.sonicle.com
> >Music:
> >http://www.gabrielebulfon.com
> >Quantum Mechanics :
> >http://www.cdbaby.com/cd/gabrielebulfon
>
> >--
> >Da: Strahil Nikolov
> >A: Cluster Labs - All topics related to open-source clustering welcomed
> >Gabriele Bulfon
> >Data: 29 luglio 2020 6.39.08 CEST
> >Oggetto: Re: [ClusterLabs] Antw: [EXT] Stonith failing
> >Do you have a reason not to use any stonith already available ?
> >Best Regards,
> >Strahil Nikolov
> >На 28 юли 2020 г. 13:26:52 GMT+03:00, Gabriele Bulfon
> >написа:
> >Thanks, I attach here the script.
> >It basically runs ssh on the other node with no password (must be
> >preconfigured via authorization keys) with commands.
> >This was taken from a script by OpenIndiana (I think).
> >As it stated in the comments, we don't want to halt or boot via ssh,
> >only reboot.
> >Maybe this is the problem, we should at least have it shutdown when
> >asked for.
> >
> >Actually if I stop corosync in node 2, I don't want it to shutdown the
> >system but just let node 1 keep control of all resources.
> >Same if I just shutdown manually node 2,
> >node 1 should keep control of all resources and release them back on
> >reboot.
> >Instead, when I stopped corosync on node 2, log was showing the
> >temptative to stonith node 2: why?
> >
> >Thanks!
> >Gabriele
> >
> >
> >
> >Sonicle S.r.l.
> >:
> >http://www.sonicle.com
> >Music:
> >http://www.gabrielebulfon.com
> >Quantum Mechanics :
> >http://www.cdbaby.com/cd/gabrielebulfon
> >Da:
> >Reid Wahl
> >A:
> >Cluster Labs - All topics related to open-source clustering welcomed
> >Data:
> >28 luglio 2020 12.03.46 CEST
> >Oggetto:
> >Re: [ClusterLabs] Antw: [EXT] Stonith failing
> >Gabriele,
> >
> >"No route to host" is a somewhat generic error message when we can't
> >find anyone to fence the node. It doesn't mean there's necessarily a
> >network routing issue at fault; no need to focus on that error message.
> >
> >I agree with Ulrich about needing to know what the script does. But
> >based on your initial message, it sounds like your custom fence agent
> >returns 1 in response to "on" and "off" actions. Am I understanding
> >correctly? If so, why does it behave that way? Pacemaker is trying to
> >run a poweroff action based on the logs, so it needs your script to
> >support an off action.
> >On Tue, Jul 28, 2020 at 2:47 AM Ulrich Windl
> >ulrich.wi...@rz.uni-regensburg.de
> >wrote:
> >Gabriele Bulfon
> >gbul...@sonicle.com
> >schrieb am 28.07.2020 um 10:56 in
> >Nachricht
> >:
> >Hi, now I have my two nodes (xstha1 and xstha2) with IPs configured by
> >Corosync.
> >To check how stonith would work, I turned off Corosync service on
> >second
> >node.
> >First node try to attempt to stonith 2nd node and take care of its
> >resources, but this fails.
> >Stonith action is configured to run a custom script to run ssh
> >commands,
> >I think you should explain what that script does exactly.
> >[...]
> >____

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Stonith failing

2020-07-30 Thread Reid Wahl
That appears to support IPMI, so fence_ipmilan is likely an option.
Further, it probably has a watchdog device. If so, then sbd is an option.

On Thu, Jul 30, 2020 at 2:00 AM Gabriele Bulfon  wrote:

> It is this system:
>
> https://www.supermicro.com/products/system/1u/1029/SYS-1029TP-DC0R.cfm
>
> it has a sas3 backplane with hotswap sas disks that are visible to both
> nodes at the same time.
>
> Gabriele
>
>
>
> *Sonicle S.r.l. *: http://www.sonicle.com
> *Music: *http://www.gabrielebulfon.com
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>
>
>
>
> --
>
> Da: Ulrich Windl 
> A: users@clusterlabs.org
> Data: 29 luglio 2020 15.15.17 CEST
> Oggetto: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Stonith failing
>
> >>> Gabriele Bulfon  schrieb am 29.07.2020 um 14:18
> in
> Nachricht <479956351.444.1596025101064@www>:
> > Hi, it's a single controller, shared to both nodes, SM server.
>
> You mean external controller, like NAS or SAN? I thought you are talking
> about
> an internal controller like SCSI...
> I don't know what an "SM server" is.
>
> Regards,
> Ulrich
>
> >
> > Thanks!
> > Gabriele
> >
> >
> > Sonicle S.r.l.
> > :
> > http://www.sonicle.com
> > Music:
> > http://www.gabrielebulfon.com
> > Quantum Mechanics :
> > http://www.cdbaby.com/cd/gabrielebulfon
> >
>
> 
> > --
> > Da: Ulrich Windl
> > A: users@clusterlabs.org
> > Data: 29 luglio 2020 9.26.39 CEST
> > Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing
> > Gabriele Bulfon
> > schrieb am 29.07.2020 um 08:01 in
> > Nachricht
> > :
> > That one was taken from a specific implementation on Solaris 11.
> > The situation is a dual node server with shared storage controller: both
> > nodes see the same disks concurrently.
> > You mean you have a dual-controler setup (one controller on each node,
> both
> > connected to the same bus)? If so Use sbd!
> > Here we must be sure that the two nodes are not going to import/mount the
> > same zpool at the same time, or we will encounter data corruption: node 1
> > will be perferred for pool 1, node 2 for pool 2, only in case one of the
> > node
> > goes down or is taken offline the resources should be first free by the
> > leaving node and taken by the other node.
> > Would you suggest one of the available stonith in this case?
> > Thanks!
> > Gabriele
> > Sonicle S.r.l.
> > :
> > http://www.sonicle.com
> > Music:
> > http://www.gabrielebulfon.com
> > Quantum Mechanics :
> > http://www.cdbaby.com/cd/gabrielebulfon
> >
>
> 
> > --
> > Da: Strahil Nikolov
> > A: Cluster Labs - All topics related to open-source clustering welcomed
> > Gabriele Bulfon
> > Data: 29 luglio 2020 6.39.08 CEST
> > Oggetto: Re: [ClusterLabs] Antw: [EXT] Stonith failing
> > Do you have a reason not to use any stonith already available ?
> > Best Regards,
> > Strahil Nikolov
> > На 28 юли 2020 г. 13:26:52 GMT+03:00, Gabriele Bulfon
> > написа:
> > Thanks, I attach here the script.
> > It basically runs ssh on the other node with no password (must be
> > preconfigured via authorization keys) with commands.
> > This was taken from a script by OpenIndiana (I think).
> > As it stated in the comments, we don't want to halt or boot via ssh,
> > only reboot.
> > Maybe this is the problem, we should at least have it shutdown when
> > asked for.
> > Actually if I stop corosync in node 2, I don't want it to shutdown the
> > system but just let node 1 keep control of all resources.
> > Same if I just shutdown manually node 2,
> > node 1 should keep control of all resources and release them back on
> > reboot.
> > Instead, when I stopped corosync on node 2, log was showing the
> > temptative to stonith node 2: why?
> > Thanks!
> > Gabriele
> > Sonicle S.r.l.
> > :
> > http://www.sonicle.com
> > Music:
> > http://www.gabrielebulfon.com
> > Quantum Mechanics :
> > http://www.cdbaby.com/cd/gabrielebulfon
> > Da:
> > Reid Wahl
> > A:
> > Cluster Labs - All topics related to open-source clustering welcomed
> > Data:
> > 28 luglio 2020 12.03.46 CEST
> > Oggetto:
> > Re: [ClusterLabs] Antw: [EXT] Stonith failing
> > 

Re: [ClusterLabs] Clear Pending Fencing Action

2020-08-02 Thread Reid Wahl
Hi, Илья. `stonith_admin --cleanup` doesn't get rid of pending actions,
only failed ones. You might be hitting
https://bugs.clusterlabs.org/show_bug.cgi?id=5401.

I believe a simultaneous reboot of both nodes will clear the pending
actions. I don't recall whether there's any other way to clear them.

On Sun, Aug 2, 2020 at 8:26 PM Илья Насонов  wrote:

> Hello!
>
>
>
> After troubleshooting 2-Node cluster, crm_mon deprecated actions are
> displayed in “Pending Fencing Action:” list.
>
> How can I delete them.
>
> «stonith_admin --cleanup --history=*» does not delete it.
>
>
>
>
>
> С уважением,
> Илья Насонов
> el...@po-mayak.ru
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Pacemaker Shutdown

2020-07-22 Thread Reid Wahl
Thanks for the clarification. As far as I'm aware, there's no way to do
this at the Pacemaker level during a Pacemaker shutdown. It would require
uncleanly killing all resources, which doesn't make sense at the Pacemaker
level.

Pacemaker only knows how to stop a resource by running the resource agent's
stop operation. Even if Pacemaker wanted to kill a resource uncleanly for
speed, the way to do so for each resource would depend on the type of
resource. For example, an IPaddr2 resource doesn't represent a running
process that can be killed; `ip addr del` would be necessary.

If we went the route of killing the Pacemaker daemon entirely, rather than
relying on it to stop resources, then that wouldn't guarantee the node has
stopped using the actual resources before the failover node tries to take
over. For example, for a Filesystem, the FS could still be mounted after
Pacemaker is killed.

The only ways to know with certainty that node 1 has stopped using cluster
resources so that node 2 can safely take them over are:

   1. gracefully stop them, or
   2. fence/reboot node 1

With that being said, if you don't mind node 1 being fenced to initiate a
faster failover, then you could fence it from node 2.

Others on the list may think of something I haven't considered here.

On Wed, Jul 22, 2020 at 2:43 PM Harvey Shepherd <
harvey.sheph...@aviatnet.com> wrote:

> Thanks for your response Reid. What you say makes sense, and under normal
> circumstances if a resource failed, I'd want all of its dependents to be
> stopped cleanly before restarting the failed resource. However if pacemaker
> is shutting down on a node (e.g. due to a restart request), then I just
> want to failover as fast as possible, so an unclean kill is fine. At the
> moment the shutdown process is taking 2 mins. I was just wondering if there
> was a way to do this.
>
> Regards,
> Harvey
>
> ------
> *From:* Users  on behalf of Reid Wahl <
> nw...@redhat.com>
> *Sent:* 23 July 2020 08:05
> *To:* Cluster Labs - All topics related to open-source clustering
> welcomed 
> *Subject:* EXTERNAL: Re: [ClusterLabs] Pacemaker Shutdown
>
>
> On Tue, Jul 21, 2020 at 11:42 PM Harvey Shepherd <
> harvey.sheph...@aviatnet.com> wrote:
>
> Hi All,
>
> I'm running Pacemaker 2.0.3 on a two-node cluster, controlling 40+
> resources which are a mixture of clones and other resources that are
> colocated with the master instance of certain clones. I've noticed that if
> I terminate pacemaker on the node that is hosting the master instances of
> the clones, Pacemaker focuses on stopping resources on that node BEFORE
> failing over to the other node, leading to a longer outage than necessary.
> Is there a way to change this behaviour?
>
>
> Hi, Harvey.
>
> As you likely know, a given resource active/passive resource will have to
> stop on one node before it can start on another node, and the same goes for
> a promoted clone instance having to demote on one node before it can
> promote on another. There are exceptions for clone instances and for
> promotable clones with promoted-max > 1 ("allow more than one master
> instance"). A resource that's configured to run on one node at a time
> should not try to run on two nodes during failover.
>
> With that in mind, what exactly are you wanting to happen? Is the problem
> that all resources are stopping on node 1 before *any* of them start on
> node 2? Or that you want Pacemaker shutdown to kill the processes on node 1
> instead of cleanly shutting them down? Or something different?
>
> These are the actions and logs I saw during the test:
>
>
> Ack. This seems like it's just telling us that Pacemaker is going through
> a graceful shutdown. The info more relevant to the resource stop/start
> order would be in /var/log/pacemaker/pacemaker.log (or less detailed in
> /var/log/messages) on the DC.
>
> # /etc/init.d/pacemaker stop
> Signaling Pacemaker Cluster Manager to terminate
>
> Waiting for cluster services to
> unload..sending
> signal 9 to procs
>
>
> 2020 Jul 22 06:16:50.581 Chassis2 daemon.notice CTR8740 pacemaker.
> Signaling Pacemaker Cluster Manager to terminate
> 2020 Jul 22 06:16:50.599 Chassis2 daemon.notice CTR8740 pacemaker. Waiting
> for cluster services to unload
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: new_event_notification (6140-6141-9): Broken
> pipe (32)
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: Notification of client
> stonithd/665bde82-cb28-40f7-9132-8321dc2f1992 failed
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warnin

[ClusterLabs] Custom resource agent

2020-07-17 Thread Reid Wahl
Based on my understanding of the question, your best options would be

   - Create a resource agent that's designed to be run as a promotable clone
   <http://10.2.2. Promotable clones> (the ocf:pacemaker:Stateful
   
<https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/Stateful>
   resource agent is a simple example), OR
   - Create two separate resource agents and configure a resource for each.
   Then configure a colocation constraint
   
<https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#s-resource-colocation>
   with a negative score, so that they cannot run on the same node
   simultaneously.


> Hi Oyvind,
> thank you for reply!
>
> Iinteresting, but not simple...
> I did some tests with ocf_heartbeat_anything and the simple start and stop
> seems to work correctly.
> Would it be possible to create two opposite "services" with this ocf?
>
> Example:
>
> NODE1 (Master)   service-master (START) & service-slave (STOP)
> NODE2 (Slave) service-master (STOP) & service-slave (START)
>
> "Service-master" puts down "service-slave" and does the opposite in the
> second node.
>
> Regards
> Sim
>
> Il giorno mar 14 lug 2020 alle ore 10:41 Oyvind Albrigtsen <
> oalbrigt at redhat.com> ha scritto:
>
> > You should be able to make your custom agent by following this doc:
> >
> >
https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc
> >
> > Oyvind
> >
> > On 13/07/20 10:08 +0200, Sim wrote:
> > >Hi,
> > >I need to create a cluster with these characteristics:
> > >
> > >NODE1 (Master)
> > >NODE2 (Slave)
> > >
> > >Example sequence to moving the role from NODE1 to NODE2:
> > >
> > >- NODE1: stopped a process with "systemctl stop"
> > >- NODE1: executed a script with parameter "slave"
> > >- NODE1: executed again the process with "systemctl start"
> > >- NODE2: stopped a process with "systemctl stop"
> > >- NODE2: executed a script "master"
> > >- NODE2: executed again the process with "systemctl start"
> > >
> > >I only found ocf_heartbeat_anything but I don't know if it's right for
me.
> > >Any suggestions?
> > >
> > >Regards
> > >Sim
> >
> > >___
> > >Manage your subscription:
> > >https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > >ClusterLabs home: https://www.clusterlabs.org/
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >

-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-17 Thread Reid Wahl
t;>>>>>> Timed out waiting for response
> >>>>>>>>>> Operation failed
> >>>>>>>>>>
> >>>>>>>>>> I am a bit confused by that, because all we did was running
> >>>>>>> commands
> >>>>>>>>>> like "sysctl -w net.ipv4.conf.all.force_igmp_version =" with
> >the
> >>>>>>>>>> different Version umbers and #cat /proc/net/igmp shows that
> >V3 is
> >>>>>>>>> used
> >>>>>>>>>> again on every device just like before...?!
> >>>>>>>>>>
> >>>>>>>>>> kind regards
> >>>>>>>>>> Stefan Schmitz
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> На 14 юли 2020 г. 11:06:42 GMT+03:00,
> >>>>>>>>>>> "stefan.schm...@farmpartner-tec.com"
> >>>>>>>>>>>  написа:
> >>>>>>>>>>>> Hello,
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am 09.07.2020 um 19:10 Strahil Nikolov wrote:
> >>>>>>>>>>>>> Have  you  run 'fence_virtd  -c' ?
> >>>>>>>>>>>> Yes I had run that on both Hosts. The current config looks
> >like
> >>>>>>>>> that
> >>>>>>>>>>>> and
> >>>>>>>>>>>> is identical on both.
> >>>>>>>>>>>>
> >>>>>>>>>>>> cat fence_virt.conf
> >>>>>>>>>>>> fence_virtd {
> >>>>>>>>>>>> listener = "multicast";
> >>>>>>>>>>>> backend = "libvirt";
> >>>>>>>>>>>> module_path = "/usr/lib64/fence-virt";
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> listeners {
> >>>>>>>>>>>> multicast {
> >>>>>>>>>>>> key_file =
> >"/etc/cluster/fence_xvm.key";
> >>>>>>>>>>>> address = "225.0.0.12";
> >>>>>>>>>>>> interface = "bond0";
> >>>>>>>>>>>> family = "ipv4";
> >>>>>>>>>>>> port = "1229";
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> backends {
> >>>>>>>>>>>> libvirt {
> >>>>>>>>>>>> uri = "qemu:///system";
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> The situation is still that no matter on what host I issue
> >the
> >>>>>>>>>>>> "fence_xvm -a 225.0.0.12 -o list" command, both guest
> >systems
> >>>>>>>>> receive
> >>>>>>>>>>>> the traffic. The local guest, but also the guest on the
> >other
> >>>>>>> host.
> >>>>>>>>> I
> >>>>>>>>>>>> reckon that means the traffic is not filtered by any
> >network
> >>>>>>>>> device,
> >>>>>>>>>>>> like switches or firewalls. Since the guest on the other
> >host
> >>>>>>>>> receives
> >>>>>>>>>>>> the packages, the traffic must reach te physical server and
> >>>>>>>>>>>> networkdevice and is then routed to the VM on that host.
> >>>>>>>>>>>> But still, the traffic is not shown on the host itself.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Further the local firewalls on both hosts are set to let
> >each
> >>>>>>>>>>>> and
> >>>>>>>>> every
> >>>>>>>>>>>> traffic pass. Accept to any and everything. Well at least
> >as far
> >>>>>>> as
> >>>>>>>>> I
> >>>>>>>>>>>> can see.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am 09.07.2020 um 22:34 Klaus Wenninger wrote:
> >>>>>>>>>>>>> makes me believe that
> >>>>>>>>>>>>> the whole setup doesn't lookas I would have
> >>>>>>>>>>>>> expected (bridges on each host where theguest
> >>>>>>>>>>>>> has a connection to and where ethernet interfaces
> >>>>>>>>>>>>> that connect the 2 hosts are part of as well
> >>>>>>>>>>>> On each physical server the networkcards are bonded to
> >achieve
> >>>>>>>>> failure
> >>>>>>>>>>>> safety (bond0). The guest are connected over a bridge(br0)
> >but
> >>>>>>>>>>>> apparently our virtualization softrware creates an own
> >device
> >>>>>>> named
> >>>>>>>>>>>> after the guest (kvm101.0).
> >>>>>>>>>>>> There is no direct connection between the servers, but as I
> >said
> >>>>>>>>>>>> earlier, the multicast traffic does reach the VMs so I
> >assume
> >>>>>>> there
> >>>>>>>>> is
> >>>>>>>>>>>> no problem with that.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am 09.07.2020 um 20:18 Vladislav Bogdanov wrote:
> >>>>>>>>>>>>> First, you need to ensure that your switch (or all
> >switches in
> >>>>>>> the
> >>>>>>>>>>>>> path) have igmp snooping enabled on host ports (and
> >probably
> >>>>>>>>>>>>> interconnects along the path between your hosts).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Second, you need an igmp querier to be enabled somewhere
> >near
> >>>>>>>>> (better
> >>>>>>>>>>>>> to have it enabled on a switch itself). Please verify that
> >you
> >>>>>>> see
> >>>>>>>>>>>> its
> >>>>>>>>>>>>> queries on hosts.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Next, you probably need to make your hosts to use IGMPv2
> >>>>>>>>>>>>> (not 3)
> >>>>>>>>> as
> >>>>>>>>>>>>> many switches still can not understand v3. This is doable
> >by
> >>>>>>>>> sysctl,
> >>>>>>>>>>>>> find on internet, there are many articles.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I have send an query to our Data center Techs who are
> >analyzing
> >>>>>>>>> this
> >>>>>>>>>>>> and
> >>>>>>>>>>>> were already on it analyzing if multicast Traffic is
> >somewhere
> >>>>>>>>> blocked
> >>>>>>>>>>>> or hindered. So far the answer is, "multicast ist explictly
> >>>>>>> allowed
> >>>>>>>>> in
> >>>>>>>>>>>> the local network and no packets are filtered or dropped".
> >I am
> >>>>>>>>> still
> >>>>>>>>>>>> waiting for a final report though.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In the meantime I have switched IGMPv3 to IGMPv2 on every
> >>>>>>> involved
> >>>>>>>>>>>> server, hosts and guests via the mentioned sysctl. The
> >switching
> >>>>>>>>> itself
> >>>>>>>>>>>> was successful, according to "cat /proc/net/igmp" but sadly
> >did
> >>>>>>> not
> >>>>>>>>>>>> better the behavior. It actually led to that no VM received
> >the
> >>>>>>>>>>>> multicast traffic anymore too.
> >>>>>>>>>>>>
> >>>>>>>>>>>> kind regards
> >>>>>>>>>>>> Stefan Schmitz
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am 09.07.2020 um 22:34 schrieb Klaus Wenninger:
> >>>>>>>>>>>>> On 7/9/20 5:17 PM, stefan.schm...@farmpartner-tec.com
> >wrote:
> >>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Well, theory still holds I would say.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I guess that the multicast-traffic from the other host
> >>>>>>>>>>>>>>> or the guestsdoesn't get to the daemon on the host.
> >>>>>>>>>>>>>>> Can't you just simply check if there are any firewall
> >>>>>>>>>>>>>>> rules configuredon the host kernel?
> >>>>>>>>>>>>>> I hope I did understand you corretcly and you are
> >referring to
> >>>>>>>>>>>> iptables?
> >>>>>>>>>>>>> I didn't say iptables because it might have been
> >>>>>>>>>>>>> nftables - but yesthat is what I was referring to.
> >>>>>>>>>>>>> Guess to understand the config the output is
> >>>>>>>>>>>>> lacking verbositybut it makes me believe that
> >>>>>>>>>>>>> the whole setup doesn't lookas I would have
> >>>>>>>>>>>>> expected (bridges on each host where theguest
> >>>>>>>>>>>>> has a connection to and where ethernet interfaces
> >>>>>>>>>>>>> that connect the 2 hosts are part of as well -
> >>>>>>>>>>>>> everythingconnected via layer 2 basically).
> >>>>>>>>>>>>>> Here is the output of the current rules. Besides the IP
> >of the
> >>>>>>>>> guest
> >>>>>>>>>>>>>> the output is identical on both hosts:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> # iptables -S
> >>>>>>>>>>>>>> -P INPUT ACCEPT
> >>>>>>>>>>>>>> -P FORWARD ACCEPT
> >>>>>>>>>>>>>> -P OUTPUT ACCEPT
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> # iptables -L
> >>>>>>>>>>>>>> Chain INPUT (policy ACCEPT)
> >>>>>>>>>>>>>> target prot opt source   destination
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Chain FORWARD (policy ACCEPT)
> >>>>>>>>>>>>>> target prot opt source   destination
> >>>>>>>>>>>>>> SOLUSVM_TRAFFIC_IN  all  --  anywhere
> >anywhere
> >>>>>>>>>>>>>> SOLUSVM_TRAFFIC_OUT  all  --  anywhere
> >anywhere
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Chain OUTPUT (policy ACCEPT)
> >>>>>>>>>>>>>> target prot opt source   destination
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Chain SOLUSVM_TRAFFIC_IN (1 references)
> >>>>>>>>>>>>>> target prot opt source   destination
> >>>>>>>>>>>>>> all  --  anywhere
> >192.168.1.14
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Chain SOLUSVM_TRAFFIC_OUT (1 references)
> >>>>>>>>>>>>>> target prot opt source   destination
> >>>>>>>>>>>>>> all  --  192.168.1.14 anywhere
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> kind regards
> >>>>>>>>>>>>>> Stefan Schmitz
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>> ___
> >>>>>> Manage your subscription:
> >>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>>>>
> >>>>>> ClusterLabs home: https://www.clusterlabs.org/
> >>>>>
> >>>>
> >>>
> >> ___
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >___
> >Manage your subscription:
> >https://lists.clusterlabs.org/mailman/listinfo/users
> >
> >ClusterLabs home: https://www.clusterlabs.org/
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Fence agent for Lenovo ThinkSystem SR630

2020-07-17 Thread Reid Wahl
The ThinkSystem SR630 XClarity Controller appears to support IPMI. So
fence_ipmilan would be a good choice to try.

Reference:
  - Lenovo ThinkSystem SR630 Server (Xeon SP Gen 1)
<https://lenovopress.com/lp0643-thinksystem-sr630-server-xeon-sp-gen1>

-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] pacemaker startup problem

2020-07-26 Thread Reid Wahl
e CIB service: Transport endpoint is not connected
> > Jul 24 18:21:39 [971] crmd: warning: do_cib_control: Couldn't
> > complete CIB registration 3 times... pause and retry
> > Jul 24 18:21:41 [971] crmd: info: crm_timer_popped: Wait Timer
> > (I_NULL) just popped (2000ms)
> > Jul 24 18:21:42 [971] crmd: info: do_cib_control: Could not connect
> > to the CIB service: Transport endpoint is not connected
> > Jul 24 18:21:42 [971] crmd: warning: do_cib_control: Couldn't
> > complete CIB registration 4 times... pause and retry
> > Jul 24 18:21:42 [968] stonith-ng: error: setup_cib: Could not connect
> > to the CIB service: Transport endpoint is not connected (-134)
> > Jul 24 18:21:42 [968] stonith-ng: error: mainloop_add_ipc_server:
> > Could not start stonith-ng IPC server: Operation not supported (-48)
> > Jul 24 18:21:42 [968] stonith-ng: error: stonith_ipc_server_init:
> > Failed to create stonith-ng servers: exiting and inhibiting respawn.
> > Jul 24 18:21:42 [968] stonith-ng: warning: stonith_ipc_server_init:
> > Verify pacemaker and pacemaker_remote are not both enabled.
> >
> > Any idea what's happening?
> > Gabriele
> >
> >
> >
> >
> > Sonicle S.r.l. : http://www.sonicle.com
> > Music: http://www.gabrielebulfon.com
> > Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] pacemaker startup problem

2020-07-26 Thread Reid Wahl
Illumos might have getpeerucred, which can also set errno to ENOTSUP.

On Sun, Jul 26, 2020 at 3:25 AM Reid Wahl  wrote:

> Hmm. If it's reading PCMK_ipc_type and matching the server type to
> QB_IPC_SOCKET, then the only other place I see it could be coming from is
> qb_ipc_auth_creds.
>
> qb_ipcs_run -> qb_ipcs_us_publish -> qb_ipcs_us_connection_acceptor ->
> qb_ipcs_uc_recv_and_auth -> process_auth -> qb_ipc_auth_creds ->
>
> static int32_t
> qb_ipc_auth_creds(struct ipc_auth_data *data)
> {
> ...
> #ifdef HAVE_GETPEERUCRED
> /*
>  * Solaris and some BSD systems
> ...
> #elif defined(HAVE_GETPEEREID)
> /*
> * Usually MacOSX systems
> ...
> #elif defined(SO_PASSCRED)
> /*
> * Usually Linux systems
> ...
> #else /* no credentials */
> data->ugp.pid = 0;
> data->ugp.uid = 0;
> data->ugp.gid = 0;
> res = -ENOTSUP;
> #endif /* no credentials */
>
> return res;
>
> I'll leave it to Ken to say whether that's likely and what it implies if
> so.
>
> On Sun, Jul 26, 2020 at 2:53 AM Gabriele Bulfon 
> wrote:
>
>> Sorry, actually the problem is not gone yet.
>> Now corosync and pacemaker are running happily, but those IPC errors are
>> coming out of heartbeat and crmd as soon as I start it.
>> The pacemakerd process has PCMK_ipc_type=socket, what's wrong with
>> heartbeat or crmd?
>>
>> Here's the env of the process:
>>
>> sonicle@xstorage1:/sonicle/etc/cluster/ha.d# penv 4222
>> 4222: /usr/sbin/pacemakerd
>> envp[0]: PCMK_respawned=true
>> envp[1]: PCMK_watchdog=false
>> envp[2]: HA_LOGFACILITY=none
>> envp[3]: HA_logfacility=none
>> envp[4]: PCMK_logfacility=none
>> envp[5]: HA_logfile=/sonicle/var/log/cluster/corosync.log
>> envp[6]: PCMK_logfile=/sonicle/var/log/cluster/corosync.log
>> envp[7]: HA_debug=0
>> envp[8]: PCMK_debug=0
>> envp[9]: HA_quorum_type=corosync
>> envp[10]: PCMK_quorum_type=corosync
>> envp[11]: HA_cluster_type=corosync
>> envp[12]: PCMK_cluster_type=corosync
>> envp[13]: HA_use_logd=off
>> envp[14]: PCMK_use_logd=off
>> envp[15]: HA_mcp=true
>> envp[16]: PCMK_mcp=true
>> envp[17]: HA_LOGD=no
>> envp[18]: LC_ALL=C
>> envp[19]: PCMK_service=pacemakerd
>> envp[20]: PCMK_ipc_type=socket
>> envp[21]: SMF_ZONENAME=global
>> envp[22]: PWD=/
>> envp[23]: SMF_FMRI=svc:/sonicle/xstream/cluster/pacemaker:default
>> envp[24]: _=/usr/sbin/pacemakerd
>> envp[25]: TZ=Europe/Rome
>> envp[26]: LANG=en_US.UTF-8
>> envp[27]: SMF_METHOD=start
>> envp[28]: SHLVL=2
>> envp[29]: PATH=/usr/sbin:/usr/bin
>> envp[30]: SMF_RESTARTER=svc:/system/svc/restarter:default
>> envp[31]: A__z="*SHLVL
>>
>>
>> Here are crmd complaints:
>>
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> Node xstorage1 state is now member
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Could not start crmd IPC server: Operation not supported (-48)
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Failed to create IPC server: shutting down and inhibiting respawn
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> The local CRM is operational
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Input I_ERROR received in state S_STARTING from do_started
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> State transition S_STARTING -> S_RECOVERY
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning:
>> Fast-tracking shutdown in response to errors
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning:
>> Input I_PENDING received in state S_RECOVERY from do_started
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Input I_TERMINATE received in state S_RECOVERY from do_recover
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> Disconnected from the LRM
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Child process pengine exited (pid=4316, rc=100)
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Could not recover from internal error
>> Jul 26 11:39:07 xstorage1 heartbeat: [ID 996084 daemon.warning] [4275]:
>> WARN: Managed /usr/libexec/pacemaker/crmd process 4315 exited with return
>> code 201.
>>
>>
>>
>>
>> *Sonicle S.r.l. *: http://www.sonicle.com

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-28 Thread Reid Wahl
Gabriele,

"No route to host" is a somewhat generic error message when we can't find
anyone to fence the node. It doesn't mean there's necessarily a network
routing issue at fault; no need to focus on that error message.

I agree with Ulrich about needing to know what the script does. But based
on your initial message, it sounds like your custom fence agent returns 1
in response to "on" and "off" actions. Am I understanding correctly? If so,
why does it behave that way? Pacemaker is trying to run a poweroff action
based on the logs, so it needs your script to support an off action.

On Tue, Jul 28, 2020 at 2:47 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Gabriele Bulfon  schrieb am 28.07.2020 um 10:56
> in
> Nachricht <1330096936.11468.1595926619455@www>:
> > Hi, now I have my two nodes (xstha1 and xstha2) with IPs configured by
> > Corosync.
> > To check how stonith would work, I turned off Corosync service on second
> > node.
> > First node try to attempt to stonith 2nd node and take care of its
> > resources, but this fails.
> > Stonith action is configured to run a custom script to run ssh commands,
>
> I think you should explain what that script does exactly.
>
> [...]
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem

2020-07-28 Thread Reid Wahl
On Tue, Jul 28, 2020 at 2:44 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Reid Wahl  schrieb am 28.07.2020 um 10:21 in
> Nachricht
> :
> > On Tuesday, July 28, 2020, Ulrich Windl <
> ulrich.wi...@rz.uni-regensburg.de>
> > wrote:
> >>>>> Gabriele Bulfon  schrieb am 28.07.2020 um
> 09:35 in
> >> Nachricht <1046247888.11369.1595921749049@www>:
> >>> Thanks, I patched all the scripts in build to have "#!/bin/bash" in
> > head, and
> >>> I receive no errors now.
> >>
> >> If it's needed, those scripts were buggy anyway.
> >
> > How does that mean the script is buggy? It would depend on what /bin/sh
> is
> > linked to on a particular system.
>
> /bin/sh may be a minimal Bourne-compatible shell. Assuming that it is
> bash-compatible is a bad idea.
> And: Is it really a problem to require bash? I mean there are scripts that
> require csh (yuk!)
>

Some resource agents do require bash (e.g., sybaseASE).

Re: csh -- there are? Gross ;)

Resource agents that have been around a while typically specify /bin/sh,
and it's considered a regression to introduce new non-portable syntax.
Syntax that was never 100% portable in the first place is a somewhat
different matter. Ideally it would have been totally portable in the first
place, though in practice, it apparently hasn't been. The main constraint
going forward is not to make it LESS portable.

"It is considered a regression to introduce a patch that will make a
previously sh compatible resource agent suitable only for bash, ksh, or any
other non-generic shell. It is, however, perfectly acceptable for a new
resource agent to explicitly define a specific shell, such as /bin/bash, as
its interpreter."
  -
https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc

I discussed a little bit with one of the devs, who said the sh portability
testing has mostly been with dash on Debian. (BTW, by default,
`checkbashisms` allows things that aren't POSIX but that are specified by
Debian Policy.)


> >
> >>There is a "checkbashisms" program (in SLES at least) that can check
> > whether a shell script actually needs BASH (or compatible)...
> >
> > That's available for RHEL too. The question is whether it's worth
> modifying
> > libraries that are core to resource-agents, essentially for compatibility
> > with vanilla ksh. The conservative answer is no, although there is a case
>
> I never used ksh, but AFAIR the only advantage of ksh was vi-compatible
> command-line edition (which bash can do at least as good).
> Is there any good reason to require ksh for scripts?
>

I doubt it. But the only place we require ksh is for one `su` command in
the sybaseASE agent. To make sure we're on the same page: this wasn't a
case of the RA specifying ksh, but rather of Gabriele's default shell being
ksh and resolving sh to ksh.


> > to be made in favor of the change. The local keyword has been there for
> > years. A lot of shells besides bash support the local keyword, and even
> if
> > ksh is the default shell on a user's system, they can likely use a
> > different one if needed, as Gabriele has done.
> >>
> >>> Though, the IP is not configured :( I'm looking at it...
> >>> Is there any easy way to debug what's doing on the IP script?
> >>>
> >>> Gabriele
> >>>
> >>>
> >>> Sonicle S.r.l.
> >>> :
> >>> http://www.sonicle.com
> >>> Music:
> >>> http://www.gabrielebulfon.com
> >>> Quantum Mechanics :
> >>> http://www.cdbaby.com/cd/gabrielebulfon
> >>>
> >
> 
> >>> --
> >>> Da: Ulrich Windl
> >>> A: users@clusterlabs.org
> >>> Data: 28 luglio 2020 9.12.41 CEST
> >>> Oggetto: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem
> >>> You could try replacing "local" with "typeset", also.
> >>> Reid Wahl
> >>> schrieb am 28.07.2020 um 09:05 in Nachricht
> >>> :
> >>> By the way, it doesn't necessarily have to be bash. Upon looking
> > further, a
> >>> lot of shells support the `local` keyword, even though it's not
> required
> > by
> >>> the POSIX standard. Plain ksh, however, does not :(
> >>> On Monday, July 27, 2020, Reid Wahl
> >>> wrote:
> >>> Hi, Gabriele. The `local` keyword is a bash built-in and not available
> in
> >>> some other shells (e.g., ksh). It's use

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Reid Wahl
"As it stated in the comments, we don't want to halt or boot via ssh, only
reboot."

Generally speaking, a stonith reboot action consists of the following basic
sequence of events:

   1. Execute the fence agent with the "off" action.
   2. Poll the power status of the fenced node until it is powered off.
   3. Execute the fence agent with the "on" action.
   4. Poll the power status of the fenced node until it is powered on.

So a custom fence agent that supports reboots, actually needs to support
off and on actions.


As Andrei noted, ssh is **not** a reliable method by which to ensure a node
gets rebooted or stops using cluster-managed resources. You can't depend on
the ability to SSH to an unhealthy node that needs to be fenced.

The only way to guarantee that an unhealthy or unresponsive node stops all
access to shared resources is to power off or reboot the node. (In the case
of resources that rely on shared storage, I/O fencing instead of power
fencing can also work, but that's not ideal.)

As others have said, SBD is a great option. Use it if you can. There are
also power fencing methods (one example is fence_ipmilan, but the options
available depend on your hardware or virt platform) that are reliable under
most circumstances.

You said that when you stop corosync on node 2, Pacemaker tries to fence
node 2. There are a couple of possible reasons for that. One possibility is
that you stopped or killed corosync without stopping Pacemaker first. (If
you use pcs, then try `pcs cluster stop`.) Another possibility is that
resources failed to stop during cluster shutdown on node 2, causing node 2
to be fenced.

On Wed, Jul 29, 2020 at 12:47 AM Andrei Borzenkov 
wrote:

>
>
> On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon 
> wrote:
>
>> That one was taken from a specific implementation on Solaris 11.
>> The situation is a dual node server with shared storage controller: both
>> nodes see the same disks concurrently.
>> Here we must be sure that the two nodes are not going to import/mount the
>> same zpool at the same time, or we will encounter data corruption:
>>
>
> ssh based "stonith" cannot guarantee it.
>
>
>
>> node 1 will be perferred for pool 1, node 2 for pool 2, only in case one
>> of the node goes down or is taken offline the resources should be first
>> free by the leaving node and taken by the other node.
>>
>> Would you suggest one of the available stonith in this case?
>>
>>
>
> IPMI, managed PDU, SBD ...
>
> In practice, the only stonith method that works in case of complete node
> outage including any power supply is SBD.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Pacemaker crashed and produce a coredump file

2020-07-29 Thread Reid Wahl
7f6eeb5c5e81 in systemd_loadunit_result 
> (reply=reply@entry=0x139f2a0,
> op=op@entry=0x13adce0) at systemd.c:175
> 175 systemd_unit_exec_with_unit(op, path);
> (gdb) up
> #5  0x7f6eeb5c6181 in systemd_loadunit_cb (pending=0x13aa380,
> user_data=0x13adce0) at systemd.c:197
> 197 systemd_loadunit_result(reply, user_data);
> (gdb) up
> #6  0x7f6eeb16f862 in complete_pending_call_and_unlock () from
> /lib64/libdbus-1.so.3
> (gdb) up
> #7  0x7f6eeb172b51 in dbus_connection_dispatch () from
> /lib64/libdbus-1.so.3
> (gdb) up
> #8  0x7f6eeb5c1e40 in pcmk_dbus_connection_dispatch
> (connection=0x13a4cb0, new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at
> dbus.c:388
> 388 dbus_connection_dispatch(connection);
> (gdb) up
> #9  0x7f6eeb171260 in
> _dbus_connection_update_dispatch_status_and_unlock () from
> /lib64/libdbus-1.so.3
> (gdb) up
> #10 0x7f6eeb172a93 in reply_handler_timeout () from
> /lib64/libdbus-1.so.3
> (gdb) up
> #11 0x7f6eeb5c1daf in pcmk_dbus_timeout_dispatch (data=0x13aa660) at
> dbus.c:491
> 491 dbus_timeout_handle(data);
> (gdb) up
> #12 0x7f6ee97a21c3 in g_timeout_dispatch () from
> /lib64/libglib-2.0.so.0
> (gdb) up
> #13 0x7f6ee97a17aa in g_main_context_dispatch () from
> /lib64/libglib-2.0.so.0
> (gdb) up
> #14 0x7f6ee97a1af8 in g_main_context_iterate.isra.24 () from
> /lib64/libglib-2.0.so.0
> (gdb) up
> #15 0x7f6ee97a1dca in g_main_loop_run () from /lib64/libglib-2.0.so.0
> (gdb) up
> #16 0x00402824 in main (argc=, argv=0x7ffce752b258)
> at main.c:344
> 344 g_main_run(mainloop);
> (gdb) r
> Starting program: /usr/libexec/pacemaker/lrmd
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> [Inferior 1 (process 3889819) exited with code 0144]
> ```
>
> From the backtrace, I found the program assert failed in function of
> "systemd_uit_exec_with_unit", because the parameter of path is "0x0".
> I don't quite understand what may lead to the failure of this assert? Is
> it a bug or a configuration problem?
>
>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing

2020-07-29 Thread Reid Wahl
On Wed, Jul 29, 2020 at 2:48 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Reid Wahl  schrieb am 29.07.2020 um 11:39 in
> Nachricht
> :
> > "As it stated in the comments, we don't want to halt or boot via ssh,
> only
> > reboot."
> >
> > Generally speaking, a stonith reboot action consists of the following
> basic
> > sequence of events:
> >
> >1. Execute the fence agent with the "off" action.
> >2. Poll the power status of the fenced node until it is powered off.
> >3. Execute the fence agent with the "on" action.
> >4. Poll the power status of the fenced node until it is powered on.
> >
> > So a custom fence agent that supports reboots, actually needs to support
> > off and on actions.
>
> Are you sure? Sbd can do "off" action, but when the node is off, it cannot
> perform an "on" action. So either you can use "off" and the node will
> remain off, or you use "reboot" and the node will be reset (and come up
> again, hopefully).
>

I'm referring to conventional power fencing agents. Sorry for not
clarifying. Conventional power fencing (e.g., fence_ipmilan and
fence_vmware_soap) is most of what I see deployed on a daily basis.


> >
> >
> > As Andrei noted, ssh is **not** a reliable method by which to ensure a
> node
> > gets rebooted or stops using cluster-managed resources. You can't depend
> on
> > the ability to SSH to an unhealthy node that needs to be fenced.
> >
> > The only way to guarantee that an unhealthy or unresponsive node stops
> all
> > access to shared resources is to power off or reboot the node. (In the
> case
> > of resources that rely on shared storage, I/O fencing instead of power
> > fencing can also work, but that's not ideal.)
> >
> > As others have said, SBD is a great option. Use it if you can. There are
> > also power fencing methods (one example is fence_ipmilan, but the options
> > available depend on your hardware or virt platform) that are reliable
> under
> > most circumstances.
> >
> > You said that when you stop corosync on node 2, Pacemaker tries to fence
> > node 2. There are a couple of possible reasons for that. One possibility
> is
> > that you stopped or killed corosync without stopping Pacemaker first. (If
> > you use pcs, then try `pcs cluster stop`.) Another possibility is that
> > resources failed to stop during cluster shutdown on node 2, causing node
> 2
> > to be fenced.
> >
> > On Wed, Jul 29, 2020 at 12:47 AM Andrei Borzenkov 
> > wrote:
> >
> >>
> >>
> >> On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon 
> >> wrote:
> >>
> >>> That one was taken from a specific implementation on Solaris 11.
> >>> The situation is a dual node server with shared storage controller:
> both
> >>> nodes see the same disks concurrently.
> >>> Here we must be sure that the two nodes are not going to import/mount
> the
> >>> same zpool at the same time, or we will encounter data corruption:
> >>>
> >>
> >> ssh based "stonith" cannot guarantee it.
> >>
> >>
> >>
> >>> node 1 will be perferred for pool 1, node 2 for pool 2, only in case
> one
> >>> of the node goes down or is taken offline the resources should be first
> >>> free by the leaving node and taken by the other node.
> >>>
> >>> Would you suggest one of the available stonith in this case?
> >>>
> >>>
> >>
> >> IPMI, managed PDU, SBD ...
> >>
> >> In practice, the only stonith method that works in case of complete node
> >> outage including any power supply is SBD.
> >> ___
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >>
> >
> >
> > --
> > Regards,
> >
> > Reid Wahl, RHCA
> > Software Maintenance Engineer, Red Hat
> > CEE - Platform Support Delivery - ClusterHA
>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ip address configuration problem

2020-07-28 Thread Reid Wahl
On Tuesday, July 28, 2020, Ulrich Windl 
wrote:
>>>> Gabriele Bulfon  schrieb am 28.07.2020 um 09:35 in
> Nachricht <1046247888.11369.1595921749049@www>:
>> Thanks, I patched all the scripts in build to have "#!/bin/bash" in
head, and
>> I receive no errors now.
>
> If it's needed, those scripts were buggy anyway.

How does that mean the script is buggy? It would depend on what /bin/sh is
linked to on a particular system.

>There is a "checkbashisms" program (in SLES at least) that can check
whether a shell script actually needs BASH (or compatible)...

That's available for RHEL too. The question is whether it's worth modifying
libraries that are core to resource-agents, essentially for compatibility
with vanilla ksh. The conservative answer is no, although there is a case
to be made in favor of the change. The local keyword has been there for
years. A lot of shells besides bash support the local keyword, and even if
ksh is the default shell on a user's system, they can likely use a
different one if needed, as Gabriele has done.
>
>> Though, the IP is not configured :( I'm looking at it...
>> Is there any easy way to debug what's doing on the IP script?
>>
>> Gabriele
>>
>>
>> Sonicle S.r.l.
>> :
>> http://www.sonicle.com
>> Music:
>> http://www.gabrielebulfon.com
>> Quantum Mechanics :
>> http://www.cdbaby.com/cd/gabrielebulfon
>>

>> --
>> Da: Ulrich Windl
>> A: users@clusterlabs.org
>> Data: 28 luglio 2020 9.12.41 CEST
>> Oggetto: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem
>> You could try replacing "local" with "typeset", also.
>> Reid Wahl
>> schrieb am 28.07.2020 um 09:05 in Nachricht
>> :
>> By the way, it doesn't necessarily have to be bash. Upon looking
further, a
>> lot of shells support the `local` keyword, even though it's not required
by
>> the POSIX standard. Plain ksh, however, does not :(
>> On Monday, July 27, 2020, Reid Wahl
>> wrote:
>> Hi, Gabriele. The `local` keyword is a bash built-in and not available in
>> some other shells (e.g., ksh). It's used in `have_binary()`, so it's
>> causing `check_binary(/usr/gnu/bin/awk)` to fail. It's also causing all
the
>> "local: not found" errors. I just reproduced it to make sure.
>> check_binary () {
>> if ! have_binary "$1"; then
>> if [ "$OCF_NOT_RUNNING" = 7 ]; then
>> # Chances are we have a fully setup OCF environment
>> ocf_exit_reason "Setup problem: couldn't find command: $1"
>> else
>> echo "Setup problem: couldn't find command: $1"
>> fi
>> exit $OCF_ERR_INSTALLED
>> fi
>> }
>> have_binary () {
>> if [ "$OCF_TESTER_FAIL_HAVE_BINARY" = "1" ]; then
>> false
>> else
>> local bin=`echo $1 | sed -e 's/ -.*//'`
>> test -x "`which $bin 2/dev/null`"
>> fi
>> }
>> Is bash available on your system?
>> On Mon, Jul 27, 2020 at 8:34 AM Gabriele Bulfon
>> wrote:
>> Hello,
>> after configuring crm for IP automatic configuration, I stumbled upon a
>> problem with the IPaddr utiliy that I don't understand:
>> IPaddr(xstha2_san0_IP)[10439]: 2020/07/27_17:26:17 ERROR: Setup problem:
>> couldn't find command: /usr/gnu/bin/awk
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such
>> file or directory] ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No such
>> file or directory] ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: not found [No such
>> file or directory] ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No such
>> file or directory] ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [ ocf-exit-reason:Setup problem:
>> couldn't find command: /usr/gnu/bin/awk ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[185]: local: not found [No such
>> file or directory] ]
>> Jul 27 

Re: [ClusterLabs] ip address configuration problem

2020-07-28 Thread Reid Wahl
That's true, but it's nice to have a tool set the required environment
variables for you.

On Tuesday, July 28, 2020, Ulrich Windl 
wrote:
>>>> Reid Wahl  schrieb am 28.07.2020 um 09:43 in
Nachricht
> :
>> Not at my computer but I believe `crm_resource -r 
--force-start`.
>>
>> On Tuesday, July 28, 2020, Gabriele Bulfon  wrote:
>>> Thanks, I patched all the scripts in build to have "#!/bin/bash" in
head,
>> and I receive no errors now.
>>> Though, the IP is not configured :( I'm looking at it...
>>>
>>> Is there any easy way to debug what's doing on the IP script?
>
> You could run it manually with "bash -x ..." to see what's going on. This
works for any script, but some OCF RAs habe even debugging messages built
in. ocf-tester's "-d" option will turn on debugging in the RA.
>
>>>
>>> Gabriele
>>>
>>>
>>>
>>> Sonicle S.r.l. : http://www.sonicle.com
>>> Music: http://www.gabrielebulfon.com
>>> Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon
>>>
>>>
>>>
>>

>> --
>>>
>>> Da: Ulrich Windl 
>>> A: users@clusterlabs.org
>>> Data: 28 luglio 2020 9.12.41 CEST
>>> Oggetto: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem
>>>
>>> You could try replacing "local" with "typeset", also.
>>>
>>>>>> Reid Wahl  schrieb am 28.07.2020 um 09:05 in
>> Nachricht
>>> :
>>>> By the way, it doesn't necessarily have to be bash. Upon looking
>> further, a
>>>> lot of shells support the `local` keyword, even though it's not
required
>> by
>>>> the POSIX standard. Plain ksh, however, does not :(
>>>>
>>>> On Monday, July 27, 2020, Reid Wahl  wrote:
>>>>> Hi, Gabriele. The `local` keyword is a bash built-in and not available
>> in
>>>> some other shells (e.g., ksh). It's used in `have_binary()`, so it's
>>>> causing `check_binary(/usr/gnu/bin/awk)` to fail. It's also causing all
>> the
>>>> "local: not found" errors. I just reproduced it to make sure.
>>>>>
>>>>> check_binary () {
>>>>> if ! have_binary "$1"; then
>>>>> if [ "$OCF_NOT_RUNNING" = 7 ]; then
>>>>> # Chances are we have a fully setup OCF environment
>>>>> ocf_exit_reason "Setup problem: couldn't find command: $1"
>>>>> else
>>>>> echo "Setup problem: couldn't find command: $1"
>>>>> fi
>>>>> exit $OCF_ERR_INSTALLED
>>>>> fi
>>>>> }
>>>>>
>>>>> have_binary () {
>>>>> if [ "$OCF_TESTER_FAIL_HAVE_BINARY" = "1" ]; then
>>>>> false
>>>>> else
>>>>> local bin=`echo $1 | sed -e 's/ -.*//'`
>>>>> test -x "`which $bin 2>/dev/null`"
>>>>> fi
>>>>> }
>>>>> Is bash available on your system?
>>>>> On Mon, Jul 27, 2020 at 8:34 AM Gabriele Bulfon 
>>>> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> after configuring crm for IP automatic configuration, I stumbled
upon a
>>>> problem with the IPaddr utiliy that I don't understand:
>>>>>>
>>>>>> IPaddr(xstha2_san0_IP)[10439]: 2020/07/27_17:26:17 ERROR: Setup
>> problem:
>>>> couldn't find command: /usr/gnu/bin/awk
>>>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>>>> xstha2_san0_IP_start_0:10439:stderr [
>>>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such
>>>> file or directory] ]
>>>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>>>> xstha2_san0_IP_start_0:10439:stderr [
>>>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No
such
>>>> file or directory] ]
>>>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>>>> xstha2_san0_IP_start_0:10439:stderr [
>>>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: not found [No
such
>>>> file or directory] ]
>>>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>>>> xstha2_san0_IP_start_0:10439:stderr [
>>>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[35

Re: [ClusterLabs] ip address configuration problem

2020-07-27 Thread Reid Wahl
Hi, Gabriele. The `local` keyword is a bash built-in and not available in
some other shells (e.g., ksh). It's used in `have_binary()`, so it's
causing `check_binary(/usr/gnu/bin/awk)` to fail. It's also causing all the
"local: not found" errors. I just reproduced it to make sure.

check_binary () {
if ! have_binary "$1"; then
if [ "$OCF_NOT_RUNNING" = 7 ]; then
# Chances are we have a fully setup OCF environment
ocf_exit_reason "Setup problem: couldn't find command: $1"
else
echo "Setup problem: couldn't find command: $1"
fi
exit $OCF_ERR_INSTALLED
fi
}

have_binary () {
if [ "$OCF_TESTER_FAIL_HAVE_BINARY" = "1" ]; then
false
else
local bin=`echo $1 | sed -e 's/ -.*//'`
test -x "`which $bin 2>/dev/null`"
fi
}

Is bash available on your system?

On Mon, Jul 27, 2020 at 8:34 AM Gabriele Bulfon  wrote:

> Hello,
>
> after configuring crm for IP automatic configuration, I stumbled upon a
> problem with the IPaddr utiliy that I don't understand:
>
> IPaddr(xstha2_san0_IP)[10439]: 2020/07/27_17:26:17 ERROR: Setup problem:
> couldn't find command: /usr/gnu/bin/awk
> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
> xstha2_san0_IP_start_0:10439:stderr [
> /usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such
> file or directory] ]
> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
> xstha2_san0_IP_start_0:10439:stderr [
> /usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No such
> file or directory] ]
> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
> xstha2_san0_IP_start_0:10439:stderr [
> /usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: not found [No such
> file or directory] ]
> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
> xstha2_san0_IP_start_0:10439:stderr [
> /usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No such
> file or directory] ]
> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
> xstha2_san0_IP_start_0:10439:stderr [ ocf-exit-reason:Setup problem:
> couldn't find command: /usr/gnu/bin/awk ]
> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
> xstha2_san0_IP_start_0:10439:stderr [
> /usr/lib/ocf/resource.d/heartbeat/IPaddr[185]: local: not found [No such
> file or directory] ]
> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
> xstha2_san0_IP_start_0:10439:stderr [
> /usr/lib/ocf/resource.d/heartbeat/IPaddr[186]: local: not found [No such
> file or directory] ]
> Jul 27 17:26:17 [10258] lrmd: info: log_finished: finished -
> rsc:xstha2_san0_IP action:start call_id:22 pid:10439 exit-code:5
> exec-time:91ms queue-time:0ms
>
> It says it cannot find /usr/gnu/bin/awk but this is absolutely not true!
>
> sonicle@xstorage1:/sonicle/home# ls -l /usr/gnu/bin/awk
> -r-xr-xr-x 1 root bin 881864 Jun 1 12:25 /usr/gnu/bin/awk
>
> sonicle@xstorage1:/sonicle/home# file /usr/gnu/bin/awk
> /usr/gnu/bin/awk: ELF 64-bit LSB executable AMD64 Version 1, dynamically
> linked, not stripped, no debugging information available
>
> what may be happening??
>
> Thanks!
> Gabriele
>
>
>
>
>
>
> *Sonicle S.r.l. *: http://www.sonicle.com
> *Music: *http://www.gabrielebulfon.com
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ip address configuration problem

2020-07-28 Thread Reid Wahl
By the way, it doesn't necessarily have to be bash. Upon looking further, a
lot of shells support the `local` keyword, even though it's not required by
the POSIX standard. Plain ksh, however, does not :(

On Monday, July 27, 2020, Reid Wahl  wrote:
> Hi, Gabriele. The `local` keyword is a bash built-in and not available in
some other shells (e.g., ksh). It's used in `have_binary()`, so it's
causing `check_binary(/usr/gnu/bin/awk)` to fail. It's also causing all the
"local: not found" errors. I just reproduced it to make sure.
>
> check_binary () {
> if ! have_binary "$1"; then
> if [ "$OCF_NOT_RUNNING" = 7 ]; then
> # Chances are we have a fully setup OCF environment
> ocf_exit_reason "Setup problem: couldn't find command: $1"
> else
> echo "Setup problem: couldn't find command: $1"
> fi
> exit $OCF_ERR_INSTALLED
> fi
> }
>
> have_binary () {
> if [ "$OCF_TESTER_FAIL_HAVE_BINARY" = "1" ]; then
> false
> else
> local bin=`echo $1 | sed -e 's/ -.*//'`
> test -x "`which $bin 2>/dev/null`"
> fi
> }
> Is bash available on your system?
> On Mon, Jul 27, 2020 at 8:34 AM Gabriele Bulfon 
wrote:
>>
>> Hello,
>>
>> after configuring crm for IP automatic configuration, I stumbled upon a
problem with the IPaddr utiliy that I don't understand:
>>
>> IPaddr(xstha2_san0_IP)[10439]: 2020/07/27_17:26:17 ERROR: Setup problem:
couldn't find command: /usr/gnu/bin/awk
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such
file or directory] ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No such
file or directory] ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: not found [No such
file or directory] ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No such
file or directory] ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [ ocf-exit-reason:Setup problem:
couldn't find command: /usr/gnu/bin/awk ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[185]: local: not found [No such
file or directory] ]
>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[186]: local: not found [No such
file or directory] ]
>> Jul 27 17:26:17 [10258] lrmd: info: log_finished: finished -
rsc:xstha2_san0_IP action:start call_id:22 pid:10439 exit-code:5
exec-time:91ms queue-time:0ms
>>
>> It says it cannot find /usr/gnu/bin/awk but this is absolutely not true!
>>
>> sonicle@xstorage1:/sonicle/home# ls -l /usr/gnu/bin/awk
>> -r-xr-xr-x 1 root bin 881864 Jun 1 12:25 /usr/gnu/bin/awk
>>
>> sonicle@xstorage1:/sonicle/home# file /usr/gnu/bin/awk
>> /usr/gnu/bin/awk: ELF 64-bit LSB executable AMD64 Version 1, dynamically
linked, not stripped, no debugging information available
>>
>> what may be happening??
>>
>> Thanks!
>> Gabriele
>>
>>
>>
>>
>>
>>
>> Sonicle S.r.l. : http://www.sonicle.com
>> Music: http://www.gabrielebulfon.com
>> Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA

-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ip address configuration problem

2020-07-28 Thread Reid Wahl
Not at my computer but I believe `crm_resource -r  --force-start`.

On Tuesday, July 28, 2020, Gabriele Bulfon  wrote:
> Thanks, I patched all the scripts in build to have "#!/bin/bash" in head,
and I receive no errors now.
> Though, the IP is not configured :( I'm looking at it...
>
> Is there any easy way to debug what's doing on the IP script?
>
> Gabriele
>
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon
>
>
>
--
>
> Da: Ulrich Windl 
> A: users@clusterlabs.org
> Data: 28 luglio 2020 9.12.41 CEST
> Oggetto: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem
>
> You could try replacing "local" with "typeset", also.
>
>>>> Reid Wahl  schrieb am 28.07.2020 um 09:05 in
Nachricht
> :
>> By the way, it doesn't necessarily have to be bash. Upon looking
further, a
>> lot of shells support the `local` keyword, even though it's not required
by
>> the POSIX standard. Plain ksh, however, does not :(
>>
>> On Monday, July 27, 2020, Reid Wahl  wrote:
>>> Hi, Gabriele. The `local` keyword is a bash built-in and not available
in
>> some other shells (e.g., ksh). It's used in `have_binary()`, so it's
>> causing `check_binary(/usr/gnu/bin/awk)` to fail. It's also causing all
the
>> "local: not found" errors. I just reproduced it to make sure.
>>>
>>> check_binary () {
>>> if ! have_binary "$1"; then
>>> if [ "$OCF_NOT_RUNNING" = 7 ]; then
>>> # Chances are we have a fully setup OCF environment
>>> ocf_exit_reason "Setup problem: couldn't find command: $1"
>>> else
>>> echo "Setup problem: couldn't find command: $1"
>>> fi
>>> exit $OCF_ERR_INSTALLED
>>> fi
>>> }
>>>
>>> have_binary () {
>>> if [ "$OCF_TESTER_FAIL_HAVE_BINARY" = "1" ]; then
>>> false
>>> else
>>> local bin=`echo $1 | sed -e 's/ -.*//'`
>>> test -x "`which $bin 2>/dev/null`"
>>> fi
>>> }
>>> Is bash available on your system?
>>> On Mon, Jul 27, 2020 at 8:34 AM Gabriele Bulfon 
>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> after configuring crm for IP automatic configuration, I stumbled upon a
>> problem with the IPaddr utiliy that I don't understand:
>>>>
>>>> IPaddr(xstha2_san0_IP)[10439]: 2020/07/27_17:26:17 ERROR: Setup
problem:
>> couldn't find command: /usr/gnu/bin/awk
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [ ocf-exit-reason:Setup problem:
>> couldn't find command: /usr/gnu/bin/awk ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[185]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[186]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: info: log_finished: finished -
>> rsc:xstha2_san0_IP action:start call_id:22 pid:10439 exit-code:5
>> exec-time:91ms queue-time:0ms
>>>>
>>>> It says it cannot find /usr/gnu/bin/awk but this is absolutely not
true!
>>>>
>>>> sonicle@xstorage1:/sonicle/home# ls -l /usr/gnu/bin/awk
>>>> -r-xr-xr-x 1 root bin 881864 Jun 1 12:25 /usr/gnu/bin/awk
>>>>
>>>

Re: [ClusterLabs] ip address configuration problem

2020-07-28 Thread Reid Wahl
Great! And it would be --force-start --verbose --verbose.

On Tuesday, July 28, 2020, Gabriele Bulfon  wrote:
> Sorry, found the reason, I have to patch all the scripts, others I missed.
>
> Gabriele
>
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon
> 
>
> Da: Gabriele Bulfon 
> A: Cluster Labs - All topics related to open-source clustering welcomed <
users@clusterlabs.org>
> Data: 28 luglio 2020 9.35.49 CEST
> Oggetto: Re: [ClusterLabs] Antw: [EXT] Re: ip address configuration
problem
>
>
>
> Thanks, I patched all the scripts in build to have "#!/bin/bash" in head,
and I receive no errors now.
> Though, the IP is not configured :( I'm looking at it...
>
> Is there any easy way to debug what's doing on the IP script?
>
> Gabriele
>
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon
>
>
>
--
>
> Da: Ulrich Windl 
> A: users@clusterlabs.org
> Data: 28 luglio 2020 9.12.41 CEST
> Oggetto: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem
>
> You could try replacing "local" with "typeset", also.
>
>>>> Reid Wahl  schrieb am 28.07.2020 um 09:05 in
Nachricht
> :
>> By the way, it doesn't necessarily have to be bash. Upon looking
further, a
>> lot of shells support the `local` keyword, even though it's not required
by
>> the POSIX standard. Plain ksh, however, does not :(
>>
>> On Monday, July 27, 2020, Reid Wahl  wrote:
>>> Hi, Gabriele. The `local` keyword is a bash built-in and not available
in
>> some other shells (e.g., ksh). It's used in `have_binary()`, so it's
>> causing `check_binary(/usr/gnu/bin/awk)` to fail. It's also causing all
the
>> "local: not found" errors. I just reproduced it to make sure.
>>>
>>> check_binary () {
>>> if ! have_binary "$1"; then
>>> if [ "$OCF_NOT_RUNNING" = 7 ]; then
>>> # Chances are we have a fully setup OCF environment
>>> ocf_exit_reason "Setup problem: couldn't find command: $1"
>>> else
>>> echo "Setup problem: couldn't find command: $1"
>>> fi
>>> exit $OCF_ERR_INSTALLED
>>> fi
>>> }
>>>
>>> have_binary () {
>>> if [ "$OCF_TESTER_FAIL_HAVE_BINARY" = "1" ]; then
>>> false
>>> else
>>> local bin=`echo $1 | sed -e 's/ -.*//'`
>>> test -x "`which $bin 2>/dev/null`"
>>> fi
>>> }
>>> Is bash available on your system?
>>> On Mon, Jul 27, 2020 at 8:34 AM Gabriele Bulfon 
>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> after configuring crm for IP automatic configuration, I stumbled upon a
>> problem with the IPaddr utiliy that I don't understand:
>>>>
>>>> IPaddr(xstha2_san0_IP)[10439]: 2020/07/27_17:26:17 ERROR: Setup
problem:
>> couldn't find command: /usr/gnu/bin/awk
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [ ocf-exit-reason:Setup problem:
>> couldn't find command: /usr/gnu/bin/awk ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr[185]: local: not found [No such
>> file or directory] ]
>>>> Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
>> xstha2_san0_IP_start_0:10439:stderr [
>> /usr/lib/ocf/resource.d/hea

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Reid Wahl
cal systemd[1]:
> /lib/systemd/system/docker.socket:6: ListenStream= references a path below
> legacy directory /var/run/, updating /var/run/docker.sock →
> /run/docker.sock; please update the unit file accordingly.
> >>> Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up
> on dummy.service start (rc=0): timeout (elapsed=259719ms,
> remaining=-159719ms)
> >>> Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result
> of start operation for dummy.service on node2.local: Timed Out
> >>> Jul 21 15:53:42 node2.local systemd[1]: Started Cluster Controlled
> dummy.
> >>> Jul 21 15:53:42 node2.local dummy[9330]: hello world 1
> >>> Jul 21 15:53:42 node2.local systemd-udevd[922]: Network interface
> NamePolicy= disabled on kernel command line, ignoring.
> >>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> fail-count-dummy.service#start_0[node2.local]: (unset) -> INFINITY
> >>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> last-failure-dummy.service#start_0[node2.local]: (unset) -> 1595336022
> >>> Jul 21 15:53:42 node2.local systemd[1]: Reloading.
> >>> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/dbus.socket:5: ListenStream= references a path below
> legacy directory /var/run/, updating /var/run/dbus/system_bus_socket →
> /run/dbus/system_bus_socket; please update the unit file accordingly.
> >>> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/docker.socket:6: ListenStream= references a path below
> legacy directory /var/run/, updating /var/run/docker.sock →
> /run/docker.sock; please update the unit file accordingly.
> >>> Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up
> on dummy.service stop (rc=0): timeout (elapsed=317181ms,
> remaining=-217181ms)
> >>
> >> 317181ms == 5 minutes. Barring pacemaker bug, you need to show
> pacemaker log since the very first start operation so we can see proper
> timing. Seeing that systemd was reloaded in between, it is quite possible
> that systemd lost track of pending job so any client waiting for
> confirmation hangs forever. Such problems were known, not sure what current
> status is (if it ever was fixed).
> >>
> >>> Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result
> of stop operation for dummy.service on node2.local: Timed Out
> >>> Jul 21 15:53:42 node2.local systemd[1]: Stopping Daemon for dummy...
> >>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> fail-count-dummy.service#stop_0[node2.local]: (unset) -> INFINITY
> >>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> last-failure-dummy.service#stop_0[node2.local]: (unset) -> 1595336022
> >>> Jul 21 15:53:42 node2.local systemd[1]: dummy.service: Succeeded.
> >>> Jul 21 15:53:42 node2.local systemd[1]: Stopped Daemon for dummy.
> >>> ... lost connection (node rebooting)
> >>>
> >>>   ___
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home:  https://www.clusterlabs.org/
> >
> >
> > --
> > Хиль Эдуард
> >
> >
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Pacemaker Shutdown

2020-07-22 Thread Reid Wahl
On Tue, Jul 21, 2020 at 11:42 PM Harvey Shepherd <
harvey.sheph...@aviatnet.com> wrote:

> Hi All,
>
> I'm running Pacemaker 2.0.3 on a two-node cluster, controlling 40+
> resources which are a mixture of clones and other resources that are
> colocated with the master instance of certain clones. I've noticed that if
> I terminate pacemaker on the node that is hosting the master instances of
> the clones, Pacemaker focuses on stopping resources on that node BEFORE
> failing over to the other node, leading to a longer outage than necessary.
> Is there a way to change this behaviour?
>

Hi, Harvey.

As you likely know, a given resource active/passive resource will have to
stop on one node before it can start on another node, and the same goes for
a promoted clone instance having to demote on one node before it can
promote on another. There are exceptions for clone instances and for
promotable clones with promoted-max > 1 ("allow more than one master
instance"). A resource that's configured to run on one node at a time
should not try to run on two nodes during failover.

With that in mind, what exactly are you wanting to happen? Is the problem
that all resources are stopping on node 1 before *any* of them start on
node 2? Or that you want Pacemaker shutdown to kill the processes on node 1
instead of cleanly shutting them down? Or something different?

These are the actions and logs I saw during the test:
>

Ack. This seems like it's just telling us that Pacemaker is going through a
graceful shutdown. The info more relevant to the resource stop/start order
would be in /var/log/pacemaker/pacemaker.log (or less detailed in
/var/log/messages) on the DC.

# /etc/init.d/pacemaker stop
> Signaling Pacemaker Cluster Manager to terminate
>
> Waiting for cluster services to
> unload..sending
> signal 9 to procs
>
>
> 2020 Jul 22 06:16:50.581 Chassis2 daemon.notice CTR8740 pacemaker.
> Signaling Pacemaker Cluster Manager to terminate
> 2020 Jul 22 06:16:50.599 Chassis2 daemon.notice CTR8740 pacemaker. Waiting
> for cluster services to unload
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: new_event_notification (6140-6141-9): Broken
> pipe (32)
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: Notification of client
> stonithd/665bde82-cb28-40f7-9132-8321dc2f1992 failed
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: new_event_notification (6140-6143-8): Broken
> pipe (32)
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: Notification of client
> attrd/a26ca273-3422-4ebe-8cb7-95849b8ff130 failed
> 2020 Jul 22 06:18:03.320 Chassis1 daemon.warning CTR8740
> pacemaker-schedulerd.6240  warning: Blind faith: not fencing unseen nodes
> 2020 Jul 22 06:18:58.941 Chassis2 user.crit CTR8740 supervisor. pacemaker
> is inactive (3).
>
> Regards,
> Harvey
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-08-15 Thread Reid Wahl
On Fri, Aug 14, 2020 at 6:10 AM Gabriele Bulfon  wrote:

> Thanks to all your suggestions, I now have the systems with stonith
> configured on ipmi.
>
> Two questions:
> - how can I simulate a stonith situation to check that everything is ok?
>

You can run `stonith_admin -B ` to tell Pacemaker to reboot the node
using the configured stonith devices. If you want to test a network
failure, you can have iptables block inbound and outbound traffic on the
heartbeat IP address on one node.


> - considering that I have both nodes with stonith against the other node,
> once the two nodes can communicate, how can I be sure the two nodes will
> not try to stonith each other?
>

The simplest option is to add a delay attribute (e.g., delay=10) to one of
the stonith devices. That way, if both nodes want to fence each other, the
node whose stonith device has a delay configured will wait for the delay to
expire before executing the reboot action.

Alternatively, you can set up corosync-qdevice, using a separate system
running qnetd server as a quorum arbitrator.


> :)
> Thanks!
> Gabriele
>
>
>
> *Sonicle S.r.l. *: http://www.sonicle.com
> *Music: *http://www.gabrielebulfon.com
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>
> --
>
>
> *Da:* Gabriele Bulfon 
> *A:* Cluster Labs - All topics related to open-source clustering welcomed
> 
> *Data:* 29 luglio 2020 14.22.42 CEST
> *Oggetto:* Re: [ClusterLabs] Antw: [EXT] Stonith failing
>
>
>
> It is a ZFS based illumos system.
> I don't think SBD is an option.
> Is there a reliable ZFS based stonith?
>
> Gabriele
>
>
>
> *Sonicle S.r.l. *: http://www.sonicle.com
> *Music: *http://www.gabrielebulfon.com
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>
> --
>
>
> *Da:* Andrei Borzenkov 
> *A:* Cluster Labs - All topics related to open-source clustering welcomed
> 
> *Data:* 29 luglio 2020 9.46.09 CEST
> *Oggetto:* Re: [ClusterLabs] Antw: [EXT] Stonith failing
>
>
>
>
> On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon 
> wrote:
>
>> That one was taken from a specific implementation on Solaris 11.
>> The situation is a dual node server with shared storage controller: both
>> nodes see the same disks concurrently.
>> Here we must be sure that the two nodes are not going to import/mount the
>> same zpool at the same time, or we will encounter data corruption:
>>
>
> ssh based "stonith" cannot guarantee it.
>
>
>> node 1 will be perferred for pool 1, node 2 for pool 2, only in case one
>> of the node goes down or is taken offline the resources should be first
>> free by the leaving node and taken by the other node.
>>
>> Would you suggest one of the available stonith in this case?
>>
>>
>
> IPMI, managed PDU, SBD ...
> In practice, the only stonith method that works in case of complete node
> outage including any power supply is SBD.
>
> ___
> Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Configuring millisecond timestamps in pacemaker.log.

2021-01-11 Thread Reid Wahl
It doesn't look like it to me, but that would be a cool feature. Corosync
now implements the %T (milliseconds) time spec from libqb if it's available
in the provided libqb version. Pacemaker uses %t (seconds).
  -
https://github.com/ClusterLabs/libqb/blob/v2.0.2/lib/log_format.c#L396-L417
  -
https://github.com/corosync/corosync/blob/v3.1.0/exec/logconfig.c#L202-L227
  -
https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.5/lib/common/logging.c#L147-L151

On Mon, Jan 11, 2021 at 11:12 AM Gerry R Sommerville 
wrote:

> Hello,
>
> I am wondering if it is possible to configure high resolution timestamps
> (including milliseconds) in the pacemaker.log? I was able to get hi-res
> timestamps in the corosync.log by adding 'timestamp: hires' under the
> logging directive in corosync.conf. I was hoping Pacemaker has something
> similar but I don't see anything in '/etc/sysconfig/pacemaker' or the
> Pacemaker documentation regarding hi-res timestamps.
>
> Gerry Sommerville
> Db2 Development, pureScale Domain
> E-mail: ge...@ca.ibm.com
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Configuring millisecond timestamps in pacemaker.log.

2021-01-11 Thread Reid Wahl
On Mon, Jan 11, 2021 at 12:17 PM Ken Gaillot  wrote:

> Pacemaker doesn't currently support it, sorry. It should be pretty easy
> to add though (when built with libqb 2), so hopefully we can get it in
> 2.1.0.
>
> Of course Pacemaker has always supported logging via syslog, and syslog
> can be configured to use high-res timestamps, so that's a workaround.
>
> Does anyone have a strong opinion regarding using high-res timestamps
> in the Pacemaker detail log whenever supported, vs adding a new
> sysconfig option for it? I feel like we have a ridiculous number of
> options already, and the detail log is expected to be verbose.
>

Not a **strong** opinion, but I'm in favor of it as a member of the support
team. If time is in sync across cluster nodes, then millisecond logs could
provide some clearer insight into some of the weirder timing issues that
I've shown you (e.g., involving CIB sync after a network issue).


> On Mon, 2021-01-11 at 19:12 +, Gerry R Sommerville wrote:
> > Hello,
> >
> > I am wondering if it is possible to configure high resolution
> > timestamps (including milliseconds) in the pacemaker.log? I was able
> > to get hi-res timestamps in the corosync.log by adding 'timestamp:
> > hires' under the logging directive in corosync.conf. I was hoping
> > Pacemaker has something similar but I don't see anything in
> > '/etc/sysconfig/pacemaker' or the Pacemaker documentation regarding
> > hi-res timestamps.
> >
> > Gerry Sommerville
> > Db2 Development, pureScale Domain
> > E-mail: ge...@ca.ibm.com
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-06 Thread Reid Wahl
On Wed, Jan 6, 2021 at 11:07 PM  wrote:
>
> Hi Steffen,
> Hi Reid,
>
> I also checked the Centos source rpm and it seems to include a fix for the 
> problem.
>
> As Steffen suggested, if you share your CIB settings, I might know something.
>
> If this issue is the same as the fix, the display will only be displayed on 
> the DC node and will not affect the operation.

According to Steffen's description, the "pending" is displayed only on
node 1, while the DC is node 3. That's another thing that makes me
wonder if this is a distinct issue.

> The pending actions shown will remain for a long time, but will not have a 
> negative impact on the cluster.
>
> Best Regards,
> Hideo Yamauchi.
>
>
> - Original Message -
> > From: Reid Wahl 
> > To: Cluster Labs - All topics related to open-source clustering welcomed 
> > 
> > Cc:
> > Date: 2021/1/7, Thu 15:58
> > Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> >
> > It's supposedly fixed in that version.
> >   - https://bugzilla.redhat.com/show_bug.cgi?id=1787749
> >   - https://access.redhat.com/solutions/4713471
> >
> > So you may be hitting a different issue (unless there's a bug in the
> > pcmk 1.1 backport of the fix).
> >
> > I may be a little bit out of my area of knowledge here, but can you
> > share the CIBs from nodes 1 and 3? Maybe Hideo, Klaus, or Ken has some
> > insight.
> >
> > On Wed, Jan 6, 2021 at 10:53 PM Steffen Vinther Sørensen
> >  wrote:
> >>
> >>  Hi Hideo,
> >>
> >>  If the fix is not going to make it into the CentOS7 pacemaker version,
> >>  I guess the stable approach to take advantage of it is to build the
> >>  cluster on another OS than CentOS7 ? A little late for that in this
> >>  case though :)
> >>
> >>  Regards
> >>  Steffen
> >>
> >>
> >>
> >>
> >>  On Thu, Jan 7, 2021 at 7:27 AM  wrote:
> >>  >
> >>  > Hi Steffen,
> >>  >
> >>  > The fix pointed out by Reid is affecting it.
> >>  >
> >>  > Since the fencing action requested by the DC node exists only in the
> > DC node, such an event occurs.
> >>  > You will need to take advantage of the modified pacemaker to resolve
> > the issue.
> >>  >
> >>  > Best Regards,
> >>  > Hideo Yamauchi.
> >>  >
> >>  >
> >>  >
> >>  > - Original Message -
> >>  > > From: Reid Wahl 
> >>  > > To: Cluster Labs - All topics related to open-source clustering
> > welcomed 
> >>  > > Cc:
> >>  > > Date: 2021/1/7, Thu 15:07
> >>  > > Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs
> > status
> >>  > >
> >>  > > Hi, Steffen. Are your cluster nodes all running the same
> > Pacemaker
> >>  > > versions? This looks like Bug 5401[1], which is fixed by upstream
> >>  > > commit df71a07[2]. I'm a little bit confused about why it
> > only shows
> >>  > > up on one out of three nodes though.
> >>  > >
> >>  > > [1] https://bugs.clusterlabs.org/show_bug.cgi?id=5401
> >>  > > [2] https://github.com/ClusterLabs/pacemaker/commit/df71a07
> >>  > >
> >>  > > On Tue, Jan 5, 2021 at 8:31 AM Steffen Vinther Sørensen
> >>  > >  wrote:
> >>  > >>
> >>  > >>  Hello
> >>  > >>
> >>  > >>  node 1 is showing this in 'pcs status'
> >>  > >>
> >>  > >>  Pending Fencing Actions:
> >>  > >>  * reboot of kvm03-node02.avigol-gcs.dk pending:
> > client=crmd.37819,
> >>  > >>  origin=kvm03-node03.avigol-gcs.dk
> >>  > >>
> >>  > >>  node 2 and node 3 outputs no such thing (node 3 is DC)
> >>  > >>
> >>  > >>  Google is not much help, how to investigate this further and
> > get rid
> >>  > >>  of such terrifying status message ?
> >>  > >>
> >>  > >>  Regards
> >>  > >>  Steffen
> >>  > >>  ___
> >>  > >>  Manage your subscription:
> >>  > >>  https://lists.clusterlabs.org/mailman/listinfo/users
> >>  > >>
> >>  > >>  ClusterLabs home: https://www.clusterlabs.org/
> >>  > >>
> >

Re: [ClusterLabs] Antw: [EXT] Re: Pending Fencing Actions shown in pcs status

2021-01-07 Thread Reid Wahl
On Thu, Jan 7, 2021 at 12:53 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Steffen Vinther Sørensen  schrieb am 07.01.2021 um
> 09:49 in
> Nachricht
> :
> > Hi Reid,
> >
> > I was under the impression that 'pcs config' was the CIB in a more
> > friendly format. Here is the 'pcs cluster cib' as requested
>
> I'd also think so (+/- parsing and presentation errors) ;-)
>

It's a friendly format for the  section of the CIB. It
doesn't include the  section of the CIB.

>
> > /Steffen
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-06 Thread Reid Wahl
Hi, Steffen. Are your cluster nodes all running the same Pacemaker
versions? This looks like Bug 5401[1], which is fixed by upstream
commit df71a07[2]. I'm a little bit confused about why it only shows
up on one out of three nodes though.

[1] https://bugs.clusterlabs.org/show_bug.cgi?id=5401
[2] https://github.com/ClusterLabs/pacemaker/commit/df71a07

On Tue, Jan 5, 2021 at 8:31 AM Steffen Vinther Sørensen
 wrote:
>
> Hello
>
> node 1 is showing this in 'pcs status'
>
> Pending Fencing Actions:
> * reboot of kvm03-node02.avigol-gcs.dk pending: client=crmd.37819,
> origin=kvm03-node03.avigol-gcs.dk
>
> node 2 and node 3 outputs no such thing (node 3 is DC)
>
> Google is not much help, how to investigate this further and get rid
> of such terrifying status message ?
>
> Regards
> Steffen
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Q: warning: new_event_notification (4527-22416-14): Broken pipe (32)

2021-01-06 Thread Reid Wahl
> > > > 0
> >
> > aborted:
> > > Action lost
> > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  warning: rsc_op
> > > > 22:
> > >
> > > prm_stonith_sbd_start_0 on h18 timed out
>
> The above is just the consequence of not getting the start result --
> the controller considers the action timed out and aborts the current
> transition.
>
> > > > ...
> > > > Dec 18 09:31:15 h18 pacemaker-controld[4479]:  notice: Peer h16
> > > > was
> > >
> > > terminated (reboot) by h18 on behalf of pacemaker-controld.4527: OK
>
> Here the action fencing action returns success
>
> > > > Dec 18 09:31:17 h18 pacemaker-execd[4476]:  notice:
> > > > prm_stonith_sbd start
> > >
> > > (call 164) exited with status 0 (execution time 110960ms, queue
> > > time
> >
> > 15001ms)
>
> The device registration already in progress does eventually complete
> successfully. The controller has already considered it lost and moved
> on, so this result will be ignored.
>
> Looking at the queue time, I'm going to guess that what happened was
> that the status action that's part of the start was queued behind the
> reboot action in progress, and the queue time was enough to push it
> over the total expected timeout.

Looks plausible. Though it's strange to me that the queue time is only
15 seconds, that the reboot operation completes immediately after the
start operation fails, and that the executor then says the start
completed successfully after 110s of execution time :/

> This is a known issue when fence actions are serialized. The fencer
> starts its timeout once the action is begun (i.e. after queueing), but
> the controller doesn't know about the queueing and starts its timeout
> once it has submitted the request to the fencer (i.e. before queueing).
>
> Red Hat BZ#1858462 is related, but there's no ClusterLabs BZ at this
> point. It's a tricky problem, so it will be a significant project.
>
> > > It could be related to pending fencing but I am not familiar with
> > > low
> > > level details.
> >
> > It looks odd: First "started", then timed out with error, then
> > successful
> > (without being rescheduled it seems).
> >
> > >
> > > > ...
> > > > Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Peer h16
> > > > was
> > >
> > > terminated (reboot) by h19 on behalf of pacemaker-controld.4479: OK
> > > > Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Transition
> > > > 0
> > >
> > > (Complete=31, Pending=0, Fired=0, Skipped=1, Incomplete=3,
> > > Source=/var/lib/pacemaker/pengine/pe-warn-9.bz2): Stopped
> >
> > So here's the delayed stonith confirmation.
> >
> > > > ...
> > > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  warning:
> > > > Unexpected result
> > > (error) was recorded for start of prm_stonith_sbd on h18 at Dec 18
> > > 09:31:14
> > > 2020
> > > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  notice:  *
> > > > Recover
> > >
> > > prm_stonith_sbd  ( h18 )
> >
> > Then after successful start another "recovery". Isn't that very odd?
>
> The first start was considered timed out, even though it eventually
> completed successfully, so the device had to be recovered (stop+start)
> due to the start timeout.
>
> >
> > Regards,
> > Ulrich
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-07 Thread Reid Wahl
Hi, Steffen. Those attachments don't contain the CIB. They contain the `pcs
config` output. You can get the cib with `pcs cluster cib >
$(hostname).cib.xml`.

Granted, it's possible that this fence action information wouldn't be in
the CIB at all. It might be stored in fencer memory.

On Thu, Jan 7, 2021 at 12:26 AM  wrote:

> Hi Steffen,
>
> > Here CIB settings attached (pcs config show) for all 3 of my nodes
> > (all 3 seems 100% identical), node03 is the DC.
>
>
> Thank you for the attachment.
>
> What is the scenario when this situation occurs?
> In what steps did the problem appear when fencing was performed (or
> failed)?
>
>
> Best Regards,
> Hideo Yamauchi.
>
>
> - Original Message -
> > From: Steffen Vinther Sørensen 
> > To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to
> open-source clustering welcomed 
> > Cc:
> > Date: 2021/1/7, Thu 17:05
> > Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> >
> > Hi Hideo,
> >
> > Here CIB settings attached (pcs config show) for all 3 of my nodes
> > (all 3 seems 100% identical), node03 is the DC.
> >
> > Regards
> > Steffen
> >
> > On Thu, Jan 7, 2021 at 8:06 AM  wrote:
> >>
> >>  Hi Steffen,
> >>  Hi Reid,
> >>
> >>  I also checked the Centos source rpm and it seems to include a fix for
> the
> > problem.
> >>
> >>  As Steffen suggested, if you share your CIB settings, I might know
> > something.
> >>
> >>  If this issue is the same as the fix, the display will only be
> displayed on
> > the DC node and will not affect the operation.
> >>  The pending actions shown will remain for a long time, but will not
> have a
> > negative impact on the cluster.
> >>
> >>  Best Regards,
> >>  Hideo Yamauchi.
> >>
> >>
> >>  - Original Message -
> >>  > From: Reid Wahl 
> >>  > To: Cluster Labs - All topics related to open-source clustering
> > welcomed 
> >>  > Cc:
> >>  > Date: 2021/1/7, Thu 15:58
> >>  > Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs
> status
> >>  >
> >>  > It's supposedly fixed in that version.
> >>  >   - https://bugzilla.redhat.com/show_bug.cgi?id=1787749
> >>  >   - https://access.redhat.com/solutions/4713471
> >>  >
> >>  > So you may be hitting a different issue (unless there's a bug in
> > the
> >>  > pcmk 1.1 backport of the fix).
> >>  >
> >>  > I may be a little bit out of my area of knowledge here, but can you
> >>  > share the CIBs from nodes 1 and 3? Maybe Hideo, Klaus, or Ken has
> some
> >>  > insight.
> >>  >
> >>  > On Wed, Jan 6, 2021 at 10:53 PM Steffen Vinther Sørensen
> >>  >  wrote:
> >>  >>
> >>  >>  Hi Hideo,
> >>  >>
> >>  >>  If the fix is not going to make it into the CentOS7 pacemaker
> > version,
> >>  >>  I guess the stable approach to take advantage of it is to build
> > the
> >>  >>  cluster on another OS than CentOS7 ? A little late for that in
> > this
> >>  >>  case though :)
> >>  >>
> >>  >>  Regards
> >>  >>  Steffen
> >>  >>
> >>  >>
> >>  >>
> >>  >>
> >>  >>  On Thu, Jan 7, 2021 at 7:27 AM 
> > wrote:
> >>  >>  >
> >>  >>  > Hi Steffen,
> >>  >>  >
> >>  >>  > The fix pointed out by Reid is affecting it.
> >>  >>  >
> >>  >>  > Since the fencing action requested by the DC node exists
> > only in the
> >>  > DC node, such an event occurs.
> >>  >>  > You will need to take advantage of the modified pacemaker to
> > resolve
> >>  > the issue.
> >>  >>  >
> >>  >>  > Best Regards,
> >>  >>  > Hideo Yamauchi.
> >>  >>  >
> >>  >>  >
> >>  >>  >
> >>  >>  > - Original Message -
> >>  >>  > > From: Reid Wahl 
> >>  >>  > > To: Cluster Labs - All topics related to open-source
> > clustering
> >>  > welcomed 
> >>  >>  > > Cc:
> >>  >>  > > Date: 2021/1/7, Thu 15:07
> >>  >>  > > Su

Re: [ClusterLabs] Best way to create a floating identity file

2021-01-06 Thread Reid Wahl
;1" ]
>
> That relies on the fact that the value will be "1" (or whatever you set
> as active_value) only if the attribute resource is currently active on
> the local node. Otherwise it will be "0" (if the resource previously
> ran on the local node but no longer is) or empty (if the resource never
> ran on the local node).
>
> > since the cron script is run on both nodes, I need to know how the
> > output can be used to determine which node will run the necessary
> > commands. If the return values are the same regardless of which node
> > I
> > run attrd_updater on, what do I use to differentiate?
> >
> > Unfortunately right now I don't have a test cluster that I can play
> > with things on, only a 'live' one that we had to rush into service
> > with a bare minimum of testing, so I'm loath to play with things on
> > it.
> >
> > Thanks!
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-06 Thread Reid Wahl
It's supposedly fixed in that version.
  - https://bugzilla.redhat.com/show_bug.cgi?id=1787749
  - https://access.redhat.com/solutions/4713471

So you may be hitting a different issue (unless there's a bug in the
pcmk 1.1 backport of the fix).

I may be a little bit out of my area of knowledge here, but can you
share the CIBs from nodes 1 and 3? Maybe Hideo, Klaus, or Ken has some
insight.

On Wed, Jan 6, 2021 at 10:53 PM Steffen Vinther Sørensen
 wrote:
>
> Hi Hideo,
>
> If the fix is not going to make it into the CentOS7 pacemaker version,
> I guess the stable approach to take advantage of it is to build the
> cluster on another OS than CentOS7 ? A little late for that in this
> case though :)
>
> Regards
> Steffen
>
>
>
>
> On Thu, Jan 7, 2021 at 7:27 AM  wrote:
> >
> > Hi Steffen,
> >
> > The fix pointed out by Reid is affecting it.
> >
> > Since the fencing action requested by the DC node exists only in the DC 
> > node, such an event occurs.
> > You will need to take advantage of the modified pacemaker to resolve the 
> > issue.
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> >
> >
> > - Original Message -
> > > From: Reid Wahl 
> > > To: Cluster Labs - All topics related to open-source clustering welcomed 
> > > 
> > > Cc:
> > > Date: 2021/1/7, Thu 15:07
> > > Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> > >
> > > Hi, Steffen. Are your cluster nodes all running the same Pacemaker
> > > versions? This looks like Bug 5401[1], which is fixed by upstream
> > > commit df71a07[2]. I'm a little bit confused about why it only shows
> > > up on one out of three nodes though.
> > >
> > > [1] https://bugs.clusterlabs.org/show_bug.cgi?id=5401
> > > [2] https://github.com/ClusterLabs/pacemaker/commit/df71a07
> > >
> > > On Tue, Jan 5, 2021 at 8:31 AM Steffen Vinther Sørensen
> > >  wrote:
> > >>
> > >>  Hello
> > >>
> > >>  node 1 is showing this in 'pcs status'
> > >>
> > >>  Pending Fencing Actions:
> > >>  * reboot of kvm03-node02.avigol-gcs.dk pending: client=crmd.37819,
> > >>  origin=kvm03-node03.avigol-gcs.dk
> > >>
> > >>  node 2 and node 3 outputs no such thing (node 3 is DC)
> > >>
> > >>  Google is not much help, how to investigate this further and get rid
> > >>  of such terrifying status message ?
> > >>
> > >>  Regards
> > >>  Steffen
> > >>  ___
> > >>  Manage your subscription:
> > >>  https://lists.clusterlabs.org/mailman/listinfo/users
> > >>
> > >>  ClusterLabs home: https://www.clusterlabs.org/
> > >>
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Reid Wahl, RHCA
> > > Senior Software Maintenance Engineer, Red Hat
> > > CEE - Platform Support Delivery - ClusterHA
> > >
> > > ___
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> > >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Q: warning: new_event_notification (4527-22416-14): Broken pipe (32)

2021-01-06 Thread Reid Wahl
Steffen, did you mean to reply to a different thread? I ask because
there's another active one and you haven't been involved in this one
yet :)

On Wed, Jan 6, 2021 at 11:01 PM Steffen Vinther Sørensen
 wrote:
>
> No I don't set that, and its not showing up in output of 'pcs stonith
> show --full'
>
> On Thu, Jan 7, 2021 at 7:46 AM Reid Wahl  wrote:
> >
> > Do you have pcmk_monitor_timeout set? Ken mentioned that the start
> > operation includes a registration and a status call. There was a bug
> > in which the status call used the pcmk_monitor_timeout (if set), but
> > the start operation that triggered it used the default timeout. So if
> > the pcmk_monitor_timeout was long enough (say, 120 seconds) and the
> > status call did not complete by then, the start operation would time
> > out without receiving an action result. This was fixed in pull
> > 2108[1].
> >
> > Also, note that the error message said, "Node h18 did not send start
> > result (via controller) within 45000ms (action timeout plus
> > cluster-delay)". The start operation began at 09:29:29 and ended at
> > 09:31:14 (1min45sec). That's because the cluster-delay is 60s by
> > default and there was a logging bug[2] at one point: the time value in
> > the log message did not add the cluster-delay.
> >
> > [1] https://github.com/ClusterLabs/pacemaker/pull/2108
> > [2] https://github.com/ClusterLabs/pacemaker/commit/b542a8f
> >
> > On Fri, Dec 18, 2020 at 8:15 AM Ken Gaillot  wrote:
> > >
> > > On Fri, 2020-12-18 at 13:32 +0100, Ulrich Windl wrote:
> > > > > > > Andrei Borzenkov  schrieb am 18.12.2020 um
> > > > > > > 12:17 in
> > > >
> > > > Nachricht :
> > > > > 18.12.2020 12:00, Ulrich Windl пишет:
> > > > > >
> > > > > > Maybe a related question: Do STONITH resources have special
> > > > > > rules, meaning
> > > > > they don't wait for successful fencing?
> > > > >
> > > > > pacemaker resources in CIB do not perform fencing. They only
> > > > > register
> > > > > fencing devices with fenced which does actual job. In particular
> > > > > ...
> > > > >
> > > > > > I saw this between fencing being initiated and fencing being
> > > > > > confirmed (h16
> > > > > was DC, now h18 became DC):
> > > > > >
> > > > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Processing
> > > > > > graph 0
> > > > >
> > > > > (ref=pe_calc-dc-1608280169-21) derived from
> > > > > /var/lib/pacemaker/pengine/pe-warn-9.bz2
> > > > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Requesting
> > > > > > fencing
> > > > >
> > > > > (reboot) of node h16
> > > > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Initiating
> > > > > > start
> > > > >
> > > > > operation prm_stonith_sbd_start_0 locally on h18
> > > > >
> > > > > ... "start" operation on pacemaker stonith resource only registers
> > > > > this
> > > > > device with fenced. It does *not* initiate stonith operation.
> > > >
> > > > Hi!
> > > >
> > > > Thanks, it's quite confusing: "notice: Initiating start operation"
> > > > sounds like
> > > > something is to be started right now; if it's just scheduled,
> > > > "notice: Queueing
> > > > start operation" or "notice: Planning start operation" would be a
> > > > better phrase
> > > > IMHO.
> > >
> > > I think what Andrei is suggesting is that the start is not a fencing
> > > (reboot/off) action, just a registration of the device with the fencer.
> > > The start (registration) itself does happen at this point. Besides
> > > registration, it also involves a status call to the device.
> > >
> > > > >
> > > > > > ...
> > > > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error: Node h18
> > > > > > did not send
> > > > > start result (via controller) within 45000ms (action timeout plus
> > > > > cluster-delay)
> > > > >
> > > > > I am not sure what happens here. Somehow fenced took very long time
> > > > > to
> > > > > respon

Re: [ClusterLabs] Note on Raid1 RA and attribute force_clones

2020-11-27 Thread Reid Wahl
As a note, I believe the mdraid resource agent is intended to replace
Raid1. But its documentation is the same in the two areas you
mentioned.

I definitely agree that the purpose of OCF_CHECK_LEVEL for this agent
should be documented in . (OCF_CHECK_LEVEL is documented in
a generic sense in the Pacemaker Explained doc.)

While it could be argued that configuring clustered MD constitutes
"knowing what you're doing" with regard to the force_clones attribute,
it wouldn't hurt to mention clustered MD as a use case where this
attribute is required.

On Fri, Nov 27, 2020 at 2:25 AM Ulrich Windl
 wrote:
>
> Hi!
>
> Reading the metadata of the Raid1 RA in SLES15 SP2, I see:
> force_clones (boolean, [false]): force ability to run as a clone
> Activating the same md RAID array on multiple nodes at the same time
> will result in data corruption and thus is forbidden by default.
>
> A safe example could be an array that is only named identically across
> all nodes, but is in fact distinct.
>
> Only set this to "true" if you know what you are doing!
> --
>
> As SLES15 features "Clustered MD", you need to set that attibute to use 
> Clustered MD, of course.
> Thus I think the documentation should be updated.
>
> The other thing is the lack of documentation for  $OCF_CHECK_LEVEL
>
> Regards,
> Ulrich
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] pacemaker alerts node_selector

2020-11-25 Thread Reid Wahl
I created https://github.com/ClusterLabs/pacemaker/pull/2241 to
correct the schema mistake.

On Wed, Nov 25, 2020 at 10:51 PM  wrote:
>
> Hi, thank you for your reply.
>
> I tried it this way:
>
> 
>path="/usr/share/pacemaker/alerts/test_alert.sh">
>   
> hana_node_1
>   
> 
>id="test_alert_1-instance_attributes-HANASID"/>
>id="test_alert_1-instance_attributes-AVZone"/>
> 
>  value="/usr/share/pacemaker/alerts/test_alert.sh"/>
>   
>   
> 
>
>
> During the save the select is been reset to
>   
>   
>  

The schema shows that  has to be empty.

  

  

  

  
  

  

  
  

  

  
  

  

  
  

  

  

  


> Do I need to specify in addition to select_nodes the section  name="select_attributes">

The  element configures the agent to receive alerts
when a node attribute changes.

For a bit more detail on how these  values work, see:
  - 
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_alert_filters

So it doesn't seem like this would be the way to configure alerts for
a particular node, which is what you've said you want to do.

I'm not very familiar with alerts off the top of my head, so I would
have to research this further unless someone else jumps in to answer
first. However, based on a cursory reading of the doc, it looks like
the  attributes do not provide a way to filter by a
particular node. The  element does allow you to
filter by node **attribute**. But the  element simply
filters "node events" in general, rather than filtering by node.
(Anyone correct me if I'm wrong.)

>
> Thank you, Alfred
>
>
> -Ursprüngliche Nachricht-
> Von: Users  Im Auftrag von Reid Wahl
> Gesendet: Donnerstag, 26. November 2020 05:30
> An: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Betreff: Re: [ClusterLabs] pacemaker alerts node_selector
>
> What version of Pacemaker are you using, and how does it behave?
>
> Depending on the error/misbehavior you're experiencing, this might have been 
> me. Looks like in commit bd451763[1], I copied the alerts-2.9.rng[2] schema 
> instead of the alerts-2.10.rng[3] schema.
>
> [1] https://github.com/ClusterLabs/pacemaker/commit/bd451763
> [2] https://github.com/ClusterLabs/pacemaker/blob/master/xml/alerts-2.9.rng
> [3] https://github.com/ClusterLabs/pacemaker/blob/master/xml/alerts-2.10.rng
>
> On Wed, Nov 25, 2020 at 9:31 AM  wrote:
> >
> > Hi, I would like to trigger an external script, if something happens on a 
> > specific node.
> >
> >
> >
> > In the documentation of alerts, i can see  but whatever I 
> > put into the XML, it’s not working…..
> >
> >
> >
> > configuration>
> >
> > 
> >
> > 
> >
> > 
> >
> >   
> >
> >   
> >
> > 
> >
> >  > value="someu...@example.com"/>
> >
> > 
> >
> > 
> >
> > 
> >
> > Can anybody send me an example about the right syntax ?
> >
> >
> >
> > Thank you very much……
> >
> >
> >
> > Best regards, Alfred
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery 
> - ClusterHA
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] pacemaker alerts node_selector

2020-11-25 Thread Reid Wahl
What version of Pacemaker are you using, and how does it behave?

Depending on the error/misbehavior you're experiencing, this might
have been me. Looks like in commit bd451763[1], I copied the
alerts-2.9.rng[2] schema instead of the alerts-2.10.rng[3] schema.

[1] https://github.com/ClusterLabs/pacemaker/commit/bd451763
[2] https://github.com/ClusterLabs/pacemaker/blob/master/xml/alerts-2.9.rng
[3] https://github.com/ClusterLabs/pacemaker/blob/master/xml/alerts-2.10.rng

On Wed, Nov 25, 2020 at 9:31 AM  wrote:
>
> Hi, I would like to trigger an external script, if something happens on a 
> specific node.
>
>
>
> In the documentation of alerts, i can see  but whatever I put 
> into the XML, it’s not working…..
>
>
>
> configuration>
>
> 
>
> 
>
> 
>
>   
>
>   
>
> 
>
> 
>
> 
>
> 
>
> 
>
> Can anybody send me an example about the right syntax ?
>
>
>
> Thank you very much……
>
>
>
> Best regards, Alfred
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Preferred node for a service (not constrained)

2020-12-03 Thread Reid Wahl
On Thu, Dec 3, 2020 at 5:09 AM Strahil Nikolov  wrote:
>
> The problem with infinity is that the moment when the node is back - there 
> will be a second failover. This is bad for bulky DBs that power down/up more 
> than 30 min (15 min down, 15 min up).

That's true for the general case. In Petr's case, he has explicitly
said that he wants that behavior.

> Best Regards,
> Strahil Nikolov
>
>
>
>
>
>
> В четвъртък, 3 декември 2020 г., 10:32:18 Гринуич+2, Andrei Borzenkov 
>  написа:
>
>
>
>
>
> On Thu, Dec 3, 2020 at 11:11 AM Ulrich Windl
>  wrote:
> >
> > >>> Strahil Nikolov  schrieb am 02.12.2020 um 22:42 
> > >>> in
> > Nachricht <311137659.2419591.1606945369...@mail.yahoo.com>:
> > > Constraints' values are varying from:
> > > infinity which equals to score of 100
> > > to:
> > > - infinity which equals to score of -100
> > >
> > > You can usually set a positive score on the prefered node which is bigger
> > > than on the other node.
> > >
> > > For example setting a location constraint like this will prefer node1:
> > > node1 - score 1
> > > node2 - score 5000
> > >
> >
> > The bad thing with those numbers is that you are never sure which number to
> > use:
> > Is 50 enough? 100 Maybe? 1000? 1? 10?
> >
>
> +INFINITY score guarantees that resource will always be active on
> preferred node as long as this node is available but allow resource to
> be started on another node if preferred node is down.
> ...
> > > I believe I used the value infinity, so it will prefer the 2nd host over
> > > the 1st if at all possible.  My 'pcs constraint':
> > >
>
> And this was the very first answer to this question.
>
>
> > > [root@centos-vsa2 ~]# pcs constraint
> > > Location Constraints:
> > >  Resource: group-zfs
> > >Enabled on: centos-vsa2 (score:INFINITY)
> > > Ordering Constraints:
> > > Colocation Constraints:
> > > Ticket Constraints:
> > >
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: high-priority messages from DLM?

2020-12-04 Thread Reid Wahl
On Fri, Dec 4, 2020 at 10:32 AM Reid Wahl  wrote:
>
> I'm inclined to agree, although maybe there's a good reason. These get
> logged with KERN_ERR priority.

I hit Enter and that email sent instead of line-breaking... anyway.

https://github.com/torvalds/linux/blob/master/fs/dlm/dlm_internal.h#L61-L62
https://github.com/torvalds/linux/blob/master/fs/dlm/lowcomms.c#L1250

>
> On Fri, Dec 4, 2020 at 5:32 AM Ulrich Windl
>  wrote:
> >
> > Hi!
> >
> > Logging into a server via iDRAC, I see several messages drom "dlm:" at the 
> > console screen. My obvious explanation is that they are on the screen, 
> > because journald (SLES15 SP2) treats them is high priority messages that 
> > should go to the screen. However IMHO they are not:
> >
> > [83035.82] dlm: closing connection to node 118
> > [84756.045008] dlm: closing connection to node 118
> > [160906.211673] dlm: Using SCTP for communications
> > [160906.239357] dlm: connecting to 118
> > [160906.239807] dlm: connecting to 116
> > [160906.241432] dlm: connected to 116
> > [160906.241448] dlm: connected to 118
> > [174464.522831] dlm: closing connection to node 116
> > [174670.058912] dlm: connecting to 116
> > [174670.061373] dlm: connected to 116
> > [175561.816821] dlm: closing connection to node 118
> > [175617.654995] dlm: connecting to 118
> > [175617.665153] dlm: connected to 118
> > [175695.310971] dlm: closing connection to node 118
> > [175695.311039] dlm: closing connection to node 116
> > [175695.311084] dlm: closing connection to node 119
> > [175759.045564] dlm: Using SCTP for communications
> > [175759.052075] dlm: connecting to 118
> > [175759.052623] dlm: connecting to 116
> > [175759.052917] dlm: connected to 116
> > [175759.053847] dlm: connected to 118
> > [432217.637844] dlm: closing connection to node 119
> > [432217.637912] dlm: closing connection to node 118
> > [432217.637953] dlm: closing connection to node 116
> > [438872.495086] dlm: Using SCTP for communications
> > [438872.499832] dlm: connecting to 118
> > [438872.500340] dlm: connecting to 116
> > [438872.500600] dlm: connected to 116
> > [438872.500642] dlm: connected to 118
> > [779424.346316] dlm: closing connection to node 116
> > [780017.597844] dlm: connecting to 116
> > [780017.616321] dlm: connected to 116
> > [783118.476060] dlm: closing connection to node 116
> > [783318.744036] dlm: connecting to 116
> > [783318.756923] dlm: connected to 116
> > [784893.793366] dlm: closing connection to node 118
> > [785082.619709] dlm: connecting to 118
> > [785082.633263] dlm: connected to 118
> >
> > Regards,
> > Ulrich
> >
> >
> >
> > _______
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: high-priority messages from DLM?

2020-12-04 Thread Reid Wahl
I'm inclined to agree, although maybe there's a good reason. These get
logged with KERN_ERR priority.

On Fri, Dec 4, 2020 at 5:32 AM Ulrich Windl
 wrote:
>
> Hi!
>
> Logging into a server via iDRAC, I see several messages drom "dlm:" at the 
> console screen. My obvious explanation is that they are on the screen, 
> because journald (SLES15 SP2) treats them is high priority messages that 
> should go to the screen. However IMHO they are not:
>
> [83035.82] dlm: closing connection to node 118
> [84756.045008] dlm: closing connection to node 118
> [160906.211673] dlm: Using SCTP for communications
> [160906.239357] dlm: connecting to 118
> [160906.239807] dlm: connecting to 116
> [160906.241432] dlm: connected to 116
> [160906.241448] dlm: connected to 118
> [174464.522831] dlm: closing connection to node 116
> [174670.058912] dlm: connecting to 116
> [174670.061373] dlm: connected to 116
> [175561.816821] dlm: closing connection to node 118
> [175617.654995] dlm: connecting to 118
> [175617.665153] dlm: connected to 118
> [175695.310971] dlm: closing connection to node 118
> [175695.311039] dlm: closing connection to node 116
> [175695.311084] dlm: closing connection to node 119
> [175759.045564] dlm: Using SCTP for communications
> [175759.052075] dlm: connecting to 118
> [175759.052623] dlm: connecting to 116
> [175759.052917] dlm: connected to 116
> [175759.053847] dlm: connected to 118
> [432217.637844] dlm: closing connection to node 119
> [432217.637912] dlm: closing connection to node 118
> [432217.637953] dlm: closing connection to node 116
> [438872.495086] dlm: Using SCTP for communications
> [438872.499832] dlm: connecting to 118
> [438872.500340] dlm: connecting to 116
> [438872.500600] dlm: connected to 116
> [438872.500642] dlm: connected to 118
> [779424.346316] dlm: closing connection to node 116
> [780017.597844] dlm: connecting to 116
> [780017.616321] dlm: connected to 116
> [783118.476060] dlm: closing connection to node 116
> [783318.744036] dlm: connecting to 116
> [783318.756923] dlm: connected to 116
> [784893.793366] dlm: closing connection to node 118
> [785082.619709] dlm: connecting to 118
> [785082.633263] dlm: connected to 118
>
> Regards,
> Ulrich
>
>
>
> _______
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] All clones are stopped when one of them fails

2020-12-07 Thread Reid Wahl
Can you provide the state4.xml file that you're using? I'm unable to
reproduce this issue by the clone instance to fail on one node.

Might need some logs as well.

On Mon, Dec 7, 2020 at 10:40 PM Pavel Levshin  wrote:
>
> Hello.
>
>
> Despite many years of Pacemaker use, it never stops fooling me...
>
>
> This time, I have faced a trivial problem. In my new setup, the cluster 
> consists of several identical nodes. A clone resource (vg.sanlock) is started 
> on every node, ensuring it has access to SAN storage. Almost all other 
> resources are colocated and ordered after vg.sanlock.
>
>
> This day, I've started a node, and vg.sanlock has failed to start. Then the 
> cluster has desided to stop all the clone instances "due to node 
> availability", taking down all other resources by dependencies. This seemes 
> illogical to me. In the case of a failing clone, I would prefer to see it 
> stopping on one node only. How do I do it properly?
>
>
> I've tried this config with Pacemaker 2.0.3 and 1.1.16, the behaviour stays 
> the same.
>
>
> Reduced test config here:
>
>
> pcs cluster auth test-pcmk0 test-pcmk1 <>/dev/tty
>
> pcs cluster setup --name test-pcmk test-pcmk0 test-pcmk1 --transport udpu \
>
>   --auto_tie_breaker 1
>
> pcs cluster start --all --wait=60
>
> pcs cluster cib tmp-cib.xml
>
> cp tmp-cib.xml tmp-cib.xml.deltasrc
>
> pcs -f tmp-cib.xml property set stonith-enabled=false
>
> pcs -f tmp-cib.xml resource defaults resource-stickiness=100
>
> pcs -f tmp-cib.xml resource create vg.sanlock ocf:pacemaker:Dummy \
>
>   op monitor interval=10 timeout=20 start interval=0s stop interval=0s \
>
>   timeout=20
>
> pcs -f tmp-cib.xml resource clone vg.sanlock interleave=true
>
> pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.xml.deltasrc
>
>
>
> And here goes cluster reaction to the failure:
>
>
> # crm_simulate -x state4.xml -S
>
>
>
> Current cluster status:
>
> Online: [ test-pcmk0 test-pcmk1 ]
>
>
>
> Clone Set: vg.sanlock-clone [vg.sanlock]
>
>  vg.sanlock  (ocf::pacemaker:Dummy): FAILED test-pcmk0
>
>  Started: [ test-pcmk1 ]
>
>
>
> Transition Summary:
>
> * Stop   vg.sanlock:0 ( test-pcmk1 )   due to node availability
>
> * Stop   vg.sanlock:1 ( test-pcmk0 )   due to node availability
>
>
>
> Executing cluster transition:
>
> * Pseudo action:   vg.sanlock-clone_stop_0
>
> * Resource action: vg.sanlock   stop on test-pcmk1
>
> * Resource action: vg.sanlock   stop on test-pcmk0
>
> * Pseudo action:   vg.sanlock-clone_stopped_0
>
> * Pseudo action:   all_stopped
>
>
>
> Revised cluster status:
>
> Online: [ test-pcmk0 test-pcmk1 ]
>
>
>
> Clone Set: vg.sanlock-clone [vg.sanlock]
>
>  Stopped: [ test-pcmk0 test-pcmk1 ]
>
>
> As a sidenote, if I make those clones globally-unique, they seem to behave 
> properly. But nowhere I found a reference to this solution. In general, 
> globally-unique clones are referred to only where resource agents make 
> distinction between clone instances. This is not the case.
>
>
> --
>
> Thanks,
>
> Pavel
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Q: high-priority messages from DLM?

2020-12-08 Thread Reid Wahl
Yeah I agree with you on all points, unless the author had a reason
for those decisions 15 years ago. Most of us do our best to avoid
touching DLM :)

If anyone wants to make changes to the DLM logging macros, I'm in
favor of it. I'm just not gonna lead the charge on it.

On Tue, Dec 8, 2020 at 12:01 AM Ulrich Windl
 wrote:
>
> >>> Reid Wahl  schrieb am 04.12.2020 um 19:33 in Nachricht
> :
> > On Fri, Dec 4, 2020 at 10:32 AM Reid Wahl  wrote:
> >>
> >> I'm inclined to agree, although maybe there's a good reason. These get
> >> logged with KERN_ERR priority.
> >
> > I hit Enter and that email sent instead of line‑breaking... anyway.
> >
> > https://github.com/torvalds/linux/blob/master/fs/dlm/dlm_internal.h#L61‑L62
>
> > https://github.com/torvalds/linux/blob/master/fs/dlm/lowcomms.c#L1250
>
> So everything log_print() outputs is an error? IMHO log_print is missing the
> priority/severity parameter...
> Comparing log_print(), log_error() and log_debug() I think that loggin code
> could benefit from some refactoring.
>
> Back on the subject:
> I think "Using SCTP for communications" is informational, not error, just as
> "closing connection to node 118" is probably notice, while "connecting to" /
> "connected to" is probably info or notice, too.
>
> Regards,
> Ulrich
>
> >
> >>
> >> On Fri, Dec 4, 2020 at 5:32 AM Ulrich Windl
> >>  wrote:
> >> >
> >> > Hi!
> >> >
> >> > Logging into a server via iDRAC, I see several messages drom "dlm:" at
> the
> > console screen. My obvious explanation is that they are on the screen,
> > because journald (SLES15 SP2) treats them is high priority messages that
> > should go to the screen. However IMHO they are not:
> >> >
> >> > [83035.82] dlm: closing connection to node 118
> >> > [84756.045008] dlm: closing connection to node 118
> >> > [160906.211673] dlm: Using SCTP for communications
> >> > [160906.239357] dlm: connecting to 118
> >> > [160906.239807] dlm: connecting to 116
> >> > [160906.241432] dlm: connected to 116
> >> > [160906.241448] dlm: connected to 118
> >> > [174464.522831] dlm: closing connection to node 116
> >> > [174670.058912] dlm: connecting to 116
> >> > [174670.061373] dlm: connected to 116
> >> > [175561.816821] dlm: closing connection to node 118
> >> > [175617.654995] dlm: connecting to 118
> >> > [175617.665153] dlm: connected to 118
> >> > [175695.310971] dlm: closing connection to node 118
> >> > [175695.311039] dlm: closing connection to node 116
> >> > [175695.311084] dlm: closing connection to node 119
> >> > [175759.045564] dlm: Using SCTP for communications
> >> > [175759.052075] dlm: connecting to 118
> >> > [175759.052623] dlm: connecting to 116
> >> > [175759.052917] dlm: connected to 116
> >> > [175759.053847] dlm: connected to 118
> >> > [432217.637844] dlm: closing connection to node 119
> >> > [432217.637912] dlm: closing connection to node 118
> >> > [432217.637953] dlm: closing connection to node 116
> >> > [438872.495086] dlm: Using SCTP for communications
> >> > [438872.499832] dlm: connecting to 118
> >> > [438872.500340] dlm: connecting to 116
> >> > [438872.500600] dlm: connected to 116
> >> > [438872.500642] dlm: connected to 118
> >> > [779424.346316] dlm: closing connection to node 116
> >> > [780017.597844] dlm: connecting to 116
> >> > [780017.616321] dlm: connected to 116
> >> > [783118.476060] dlm: closing connection to node 116
> >> > [783318.744036] dlm: connecting to 116
> >> > [783318.756923] dlm: connected to 116
> >> > [784893.793366] dlm: closing connection to node 118
> >> > [785082.619709] dlm: connecting to 118
> >> > [785082.633263] dlm: connected to 118
> >> >
> >> > Regards,
> >> > Ulrich
> >> >
> >> >
> >> >
> >> > ___
> >> > Manage your subscription:
> >> > https://lists.clusterlabs.org/mailman/listinfo/users
> >> >
> >> > ClusterLabs home: https://www.clusterlabs.org/
> >> >
> >>
> >>
> >> ‑‑
> >> Regards,
> >>
> >> Reid Wahl, RHCA
> >> Senior Software Maintenance Engineer, Red Hat
> >> CEE ‑ Platform Support Delivery ‑ ClusterHA
> >
> >
> >
> > ‑‑
> > Regards,
> >
> > Reid Wahl, RHCA
> > Senior Software Maintenance Engineer, Red Hat
> > CEE ‑ Platform Support Delivery ‑ ClusterHA
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] All clones are stopped when one of them fails

2020-12-10 Thread Reid Wahl
On Thu, Dec 10, 2020 at 1:13 AM Reid Wahl  wrote:
>
> On Thu, Dec 10, 2020 at 1:08 AM Reid Wahl  wrote:
> >
> > Thanks. I see it's only reproducible with stonith-enabled=false.
> > That's the step I was skipping previously, as I always have stonith
> > enabled in my clusters.
> >
> > I'm not sure whether that's expected behavior for some reason when
> > stonith is disabled. Maybe someone else (e.g., Ken) can weigh in.
>
> Never mind. This was a mistake on my part: I didn't re-add the stonith
> **device** configuration when I re-enabled stonith.
>
> So the behavior is the same regardless of whether stonith is enabled
> or not. I attribute it to the OCF_ERR_CONFIGURED error.
>
> Why exactly is this behavior unexpected, from your point of view?
>
> Ref:
>   - 
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Administration/#_how_are_ocf_return_codes_interpreted

Going back to your original email, I think I understand. What type of
resource is vg.sanlock in your main cluster? I presume that it isn't
an ocf:pacemaker:Dummy resource like it is in the state4.xml file.

It seems that your real concern is with the behavior of one or more
resource agents. When a resource agent returns OCF_ERR_CONFIGURED,
Pacemaker stops all instances of that resource and prevents it from
starting again. However, the place to address it is in the resource
agent. Pacemaker is doing exactly what the resource agent is telling
it to do.

> > I also noticed that the state4.xml file has a return code of 6 for the
> > resource's start operation. That's an OCF_ERR_CONFIGURED, which is a
> > fatal error. At least for primitive resources, this type of error
> > prevents the resource from starting anywhere. So I'm somewhat
> > surprised that the clone instances don't stop on all nodes even when
> > fencing **is** enabled.
> >
> >
> > Without stonith:
> >
> > Allocation scores:
> > pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node1: 
> > -INFINITY
> > pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node2: 
> > -INFINITY
> > pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> > pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> > pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> > pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
> > pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> > pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> > pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> > pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
> >
> > Transition Summary:
> >  * Stop   vg.bv_sanlock:0 ( node2 )   due to node availability
> >  * Stop   vg.bv_sanlock:1 ( node1 )   due to node availability
> >
> > Executing cluster transition:
> >  * Pseudo action:   vg.bv_sanlock-clone_stop_0
> >  * Resource action: vg.bv_sanlock   stop on node2
> >  * Resource action: vg.bv_sanlock   stop on node1
> >  * Pseudo action:   vg.bv_sanlock-clone_stopped_0
> >
> >
> >
> > With stonith:
> >
> > Allocation scores:
> > pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node1: 
> > -INFINITY
> > pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node2: 
> > -INFINITY
> > pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> > pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> > pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> > pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
> > pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> > pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> > pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> > pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
> >
> > Transition Summary:
> >
> > Executing cluster transition:
> >
> > On Wed, Dec 9, 2020 at 10:33 PM Pavel Levshin  wrote:
> > >
> > >
> > > See the file attached. This one has been produced and tested with
> > > pacemaker 1.1.16 (RHEL 7).
> > >
> > >
> > > --
> > >
> > > Pavel
> > >
> > >
> > > 08.12.2020 10:14, Reid Wahl :
> > > > Can you provide the state4.xml file that you're using? I'm unable to
> > > > reproduc

Re: [ClusterLabs] All clones are stopped when one of them fails

2020-12-10 Thread Reid Wahl
Thanks. I see it's only reproducible with stonith-enabled=false.
That's the step I was skipping previously, as I always have stonith
enabled in my clusters.

I'm not sure whether that's expected behavior for some reason when
stonith is disabled. Maybe someone else (e.g., Ken) can weigh in.

I also noticed that the state4.xml file has a return code of 6 for the
resource's start operation. That's an OCF_ERR_CONFIGURED, which is a
fatal error. At least for primitive resources, this type of error
prevents the resource from starting anywhere. So I'm somewhat
surprised that the clone instances don't stop on all nodes even when
fencing **is** enabled.


Without stonith:

Allocation scores:
pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node1: -INFINITY
pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node2: -INFINITY
pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY

Transition Summary:
 * Stop   vg.bv_sanlock:0 ( node2 )   due to node availability
 * Stop   vg.bv_sanlock:1 ( node1 )   due to node availability

Executing cluster transition:
 * Pseudo action:   vg.bv_sanlock-clone_stop_0
 * Resource action: vg.bv_sanlock   stop on node2
 * Resource action: vg.bv_sanlock   stop on node1
 * Pseudo action:   vg.bv_sanlock-clone_stopped_0



With stonith:

Allocation scores:
pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node1: -INFINITY
pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node2: -INFINITY
pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY

Transition Summary:

Executing cluster transition:

On Wed, Dec 9, 2020 at 10:33 PM Pavel Levshin  wrote:
>
>
> See the file attached. This one has been produced and tested with
> pacemaker 1.1.16 (RHEL 7).
>
>
> --
>
> Pavel
>
>
> 08.12.2020 10:14, Reid Wahl :
> > Can you provide the state4.xml file that you're using? I'm unable to
> > reproduce this issue by the clone instance to fail on one node.
> >
> > Might need some logs as well.
> >
> > On Mon, Dec 7, 2020 at 10:40 PM Pavel Levshin  wrote:
> >> Hello.
> >>
> >>
> >> Despite many years of Pacemaker use, it never stops fooling me...
> >>
> >>
> >> This time, I have faced a trivial problem. In my new setup, the cluster 
> >> consists of several identical nodes. A clone resource (vg.sanlock) is 
> >> started on every node, ensuring it has access to SAN storage. Almost all 
> >> other resources are colocated and ordered after vg.sanlock.
> >>
> >>
> >> This day, I've started a node, and vg.sanlock has failed to start. Then 
> >> the cluster has desided to stop all the clone instances "due to node 
> >> availability", taking down all other resources by dependencies. This 
> >> seemes illogical to me. In the case of a failing clone, I would prefer to 
> >> see it stopping on one node only. How do I do it properly?
> >>
> >>
> >> I've tried this config with Pacemaker 2.0.3 and 1.1.16, the behaviour 
> >> stays the same.
> >>
> >>
> >> Reduced test config here:
> >>
> >>
> >> pcs cluster auth test-pcmk0 test-pcmk1 <>/dev/tty
> >>
> >> pcs cluster setup --name test-pcmk test-pcmk0 test-pcmk1 --transport udpu \
> >>
> >>--auto_tie_breaker 1
> >>
> >> pcs cluster start --all --wait=60
> >>
> >> pcs cluster cib tmp-cib.xml
> >>
> >> cp tmp-cib.xml tmp-cib.xml.deltasrc
> >>
> >> pcs -f tmp-cib.xml property set stonith-enabled=false
> >>
> >> pcs -f tmp-cib.xml resource defaults resource-stick

Re: [ClusterLabs] All clones are stopped when one of them fails

2020-12-10 Thread Reid Wahl
On Thu, Dec 10, 2020 at 1:08 AM Reid Wahl  wrote:
>
> Thanks. I see it's only reproducible with stonith-enabled=false.
> That's the step I was skipping previously, as I always have stonith
> enabled in my clusters.
>
> I'm not sure whether that's expected behavior for some reason when
> stonith is disabled. Maybe someone else (e.g., Ken) can weigh in.

Never mind. This was a mistake on my part: I didn't re-add the stonith
**device** configuration when I re-enabled stonith.

So the behavior is the same regardless of whether stonith is enabled
or not. I attribute it to the OCF_ERR_CONFIGURED error.

Why exactly is this behavior unexpected, from your point of view?

Ref:
  - 
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Administration/#_how_are_ocf_return_codes_interpreted


> I also noticed that the state4.xml file has a return code of 6 for the
> resource's start operation. That's an OCF_ERR_CONFIGURED, which is a
> fatal error. At least for primitive resources, this type of error
> prevents the resource from starting anywhere. So I'm somewhat
> surprised that the clone instances don't stop on all nodes even when
> fencing **is** enabled.
>
>
> Without stonith:
>
> Allocation scores:
> pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node2: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
>
> Transition Summary:
>  * Stop   vg.bv_sanlock:0 ( node2 )   due to node availability
>  * Stop   vg.bv_sanlock:1 ( node1 )   due to node availability
>
> Executing cluster transition:
>  * Pseudo action:   vg.bv_sanlock-clone_stop_0
>  * Resource action: vg.bv_sanlock   stop on node2
>  * Resource action: vg.bv_sanlock   stop on node1
>  * Pseudo action:   vg.bv_sanlock-clone_stopped_0
>
>
>
> With stonith:
>
> Allocation scores:
> pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node2: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
>
> Transition Summary:
>
> Executing cluster transition:
>
> On Wed, Dec 9, 2020 at 10:33 PM Pavel Levshin  wrote:
> >
> >
> > See the file attached. This one has been produced and tested with
> > pacemaker 1.1.16 (RHEL 7).
> >
> >
> > --
> >
> > Pavel
> >
> >
> > 08.12.2020 10:14, Reid Wahl :
> > > Can you provide the state4.xml file that you're using? I'm unable to
> > > reproduce this issue by the clone instance to fail on one node.
> > >
> > > Might need some logs as well.
> > >
> > > On Mon, Dec 7, 2020 at 10:40 PM Pavel Levshin  wrote:
> > >> Hello.
> > >>
> > >>
> > >> Despite many years of Pacemaker use, it never stops fooling me...
> > >>
> > >>
> > >> This time, I have faced a trivial problem. In my new setup, the cluster 
> > >> consists of several identical nodes. A clone resource (vg.sanlock) is 
> > >> started on every node, ensuring it has access to SAN storage. Almost all 
> > >> other resources are colocated and ordered after vg.sanlock.
> > >>
> > >>
> > >> This day, I've started a node, and vg.sanlock has failed to start. Then 
> > >> the cluster has desided to stop all the clone instances "due to node 
> > >> availability", taking down all other resources by dependencies. This 
> > >> s

Re: [ClusterLabs] All clones are stopped when one of them fails

2020-12-10 Thread Reid Wahl
On Thursday, December 10, 2020, Pavel Levshin  wrote:
>
> You are absolutely right about RA return code. It is LVM-activate plugin
(as supplied with resource-agents-4.1.1 in Centos 8) with a slight
modification to allow sanlock instead of DLM, which is missing from Centos
8.
>
>
> So, this plugin erroneously return OCF_ERR_CONFIGURED in many cases, when
there is a problem with local configuration on the node. In my case, it
should return OCF_ERR_INSTALLED instead. Many thanks for the analysis!

I agree. I became aware of this issue about a week and a half ago, even
though it's always existed. Since then we've (Red Hat) received about two
additional reports of it, including yours.

We're looking into some potential improvements. I'll try to remember to
send BZ links when I'm at my computer again.
>
>
> --
>
> Pavel
>
>
> 10.12.2020 12:21, Reid Wahl:
>>
>> On Thu, Dec 10, 2020 at 1:13 AM Reid Wahl  wrote:
>>>
>>> On Thu, Dec 10, 2020 at 1:08 AM Reid Wahl  wrote:
>>>>
>>>> Thanks. I see it's only reproducible with stonith-enabled=false.
>>>> That's the step I was skipping previously, as I always have stonith
>>>> enabled in my clusters.
>>>>
>>>> I'm not sure whether that's expected behavior for some reason when
>>>> stonith is disabled. Maybe someone else (e.g., Ken) can weigh in.
>>>
>>> Never mind. This was a mistake on my part: I didn't re-add the stonith
>>> **device** configuration when I re-enabled stonith.
>>>
>>> So the behavior is the same regardless of whether stonith is enabled
>>> or not. I attribute it to the OCF_ERR_CONFIGURED error.
>>>
>>> Why exactly is this behavior unexpected, from your point of view?
>>>
>>> Ref:
>>>-
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Administration/#_how_are_ocf_return_codes_interpreted
>>
>> Going back to your original email, I think I understand. What type of
>> resource is vg.sanlock in your main cluster? I presume that it isn't
>> an ocf:pacemaker:Dummy resource like it is in the state4.xml file.
>>
>> It seems that your real concern is with the behavior of one or more
>> resource agents. When a resource agent returns OCF_ERR_CONFIGURED,
>> Pacemaker stops all instances of that resource and prevents it from
>> starting again. However, the place to address it is in the resource
>> agent. Pacemaker is doing exactly what the resource agent is telling
>> it to do.
>>
>>>> I also noticed that the state4.xml file has a return code of 6 for the
>>>> resource's start operation. That's an OCF_ERR_CONFIGURED, which is a
>>>> fatal error. At least for primitive resources, this type of error
>>>> prevents the resource from starting anywhere. So I'm somewhat
>>>> surprised that the clone instances don't stop on all nodes even when
>>>> fencing **is** enabled.
>>>>
>>>>
>>>> Without stonith:
>>>>
>>>> Allocation scores:
>>>> pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node1:
-INFINITY
>>>> pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node2:
-INFINITY
>>>> pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node1:
-INFINITY
>>>> pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node2:
-INFINITY
>>>> pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node1:
-INFINITY
>>>> pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node2:
-INFINITY
>>>> pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node1:
-INFINITY
>>>> pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node2:
-INFINITY
>>>> pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node1:
-INFINITY
>>>> pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node2:
-INFINITY
>>>>
>>>> Transition Summary:
>>>>   * Stop   vg.bv_sanlock:0 ( node2 )   due to node availability
>>>>   * Stop   vg.bv_sanlock:1 ( node1 )   due to node availability
>>>>
>>>> Executing cluster transition:
>>>>   * Pseudo action:   vg.bv_sanlock-clone_stop_0
>>>>   * Resource action: vg.bv_sanlock   stop on node2
>>>>   * Resource action: vg.bv_sanlock   stop on node1
>>>>   * Pseudo action:   vg.bv_sanlock-clone_stopped_0
>>>>
>>>>
>>>>
>>>> With stonith:
>>>>
>>>> Allocation scores:
>>>> pcmk__clone_allocate: vg.bv_sanlock-clone alloc

Re: [ClusterLabs] Question on restart of resource during fail over

2020-12-02 Thread Reid Wahl
How did you resolve the issue? I see a problem in the CIB, and it may
be related to the issue you encountered. Even if not, it may cause
other issues later.

You have the following resource group:

  








  

You have the following colocation constraint set:

  

  
  
  
  

  

The group says "place ClusterIP, then place halvmd, then place clxfs,
then place db2inst".
The constraint set says "place db2inst, then place halvmd, then place
clxfs, then place ClusterIP"[1].

A resource group is already an implicit set of ordering and colocation
constraints[2]. If you're happy with the order configured in the
resource group, then you should remove the colocation_set_dthdcs
constraint.

[1] Example 5.15. Equivalent colocation chain expressed using
resource_set 
(https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#idm46061107170640)
[2] ⁠10.1. Groups - A Syntactic Shortcut
(https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#group-resources)

On Wed, Dec 2, 2020 at 4:01 AM Harishkumar Pathangay
 wrote:
>
> Hi,
>
> I realized it can be used in standard mode only after you pointing to that.
>
> Anyways, writing custom agent always gives me a good understanding of the 
> resources start/stop/monitor etc…
>
> My custom agent still has lot of “hard coded” values, but it is meant for 
> studying and understanding purposes rather than to put in a production 
> machine.
>
>
>
> Please find attachments.
>
>
>
> Thanks,
>
> Harish P
>
>
>
> Sent from Mail for Windows 10
>
>
>
> From: Reid Wahl
> Sent: 02 December 2020 15:55
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Question on restart of resource during fail over
>
>
>
> On Wed, Dec 2, 2020 at 2:16 AM Harishkumar Pathangay
>  wrote:
> >
> > Just got the issue resolved.
>
> Nice work!
>
> > Any case I will send the cib.xml and my custom db2 resource agent.
> >
> > The existing resource agent is for HADR database, where there are two 
> > databases one running as Primary and other as standby.
>
> HADR is only one option. There's also a standard mode:
>   - 
> https://github.com/oalbrigt/resource-agents/blob/master/heartbeat/db2#L64-L69
>
> I don't know much about DB2, so I'm not sure whether that would meet
> your needs. Based on the metadata, standard mode appears to manage a
> single instance (with the databases you select) on one node at a time.
>
> > I have created a script which will start/stop db2 instances with a single 
> > database on shared logical volume [HA-LVM] exclusively activated on one 
> > node.
> >
> >
> >
> > Will mail you shortly.
> >
> >
> >
> > Thanks,
> >
> > Harish P
> >
> >
> >
> > Sent from Mail for Windows 10
> >
> >
> >
> > From: Reid Wahl
> > Sent: 02 December 2020 12:46
> > To: Cluster Labs - All topics related to open-source clustering welcomed
> > Subject: Re: [ClusterLabs] Question on restart of resource during fail over
> >
> >
> >
> > Can you share your pacemaker configuration (i.e.,
> > /var/lib/pacemaker/cib/cib.xml)? If you're concerned about quorum,
> > then also share your /etc/corosync/corosync.conf just in case.
> >
> > Also there's a db2 resource agent already written, if you're interested:
> > - https://github.com/oalbrigt/resource-agents/blob/master/heartbeat/db2
> >
> > On Tue, Dec 1, 2020 at 9:50 AM Harishkumar Pathangay
> >  wrote:
> > >
> > > Hi,
> > >
> > > I have DB2 resource agent scripted by myself.
> > >
> > > It is working fine with a small glitch.
> > >
> > >
> > >
> > > I have node1 and node2 in the cluster. No stonith enabled as I don't need 
> > > one. The environment is for learning purpose only.
> > >
> > >
> > >
> > > If node one is down [power off], it is starting the resource on other 
> > > node which is good. My custom resource agent doing its job. Let us say 
> > > DB2 is running with pid 4567.
> > >
> > >
> > >
> > > Now, the original node which went down is back again.  I issue “pcs 
> > > cluster start” on the node. Node is online. The resource also stays in 
> > > the other node, which is again good. That way unnecessary movement of 
> > > resources is avoided, exactly what I wan

Re: [ClusterLabs] Antw: [EXT] sbd v1.4.2

2020-12-03 Thread Reid Wahl
On Thu, Dec 3, 2020 at 12:03 AM Ulrich Windl
 wrote:
>
> Hi!
>
> See comments inline...
>
> >>> Klaus Wenninger  schrieb am 02.12.2020 um 22:05 in
> Nachricht <1b29fa92-b1b7-2315-fbcf-0787ec0e1...@redhat.com>:
> > Hi sbd ‑ developers & users!
> >
> > Thanks to everybody for contributing to tests and
> > further development.
> >
> > Improvements in build/CI‑friendlyness and
> > added robustness against misconfiguration
> > justify labeling the repo v1.4.2.
> >
> > I tried to quickly summarize the changes in the
> > repo since it was labeled v1.4.1:
> >
> > ‑ improve build/CI‑friendlyness
> >
> >   * travis: switch to F32 as build‑host
> > switch to F32 & leap‑15.2
> > changes for mock‑2.0
> > turn off loop‑devices & device‑mapper on x86_64 targets because
> > of changes in GCE
> >   * regressions.sh: get timeouts from disk‑header to go with proper
> defaults
> > for architecture
> >   * use configure for watchdog‑default‑timeout & others
> >   * ship sbd.pc with basic sbd build information for downstream packages
> > to use
> >   * add number of commits since version‑tag to build‑counter
> >
> > ‑ add robustness against misconfiguration / improve documentation
> >
> >   * add environment section to man‑page previously just available in
> > template‑config
> >   * inform the user to restart the sbd service after disk‑initialization
>
> I thought with adding UUIDs sbd automatically detects a header change.
>
> >   * refuse to start if any of the configured device names is invalid
>
> Is this a good idea? Assume you configured two devices, and one device fails.
> Do you really want to prevent sbd startup then?

AFAICT, it's just making sure the device name is of a valid format.

https://github.com/ClusterLabs/sbd/blob/master/src/sbd-inquisitor.c#L830-L833
-> https://github.com/ClusterLabs/sbd/blob/master/src/sbd-inquisitor.c#L65-L78
-- --> 
https://github.com/ClusterLabs/sbd/blob/master/src/sbd-common.c#L1189-L1220

> >   * add handshake to sync startup/shutdown with pacemakerd
> > Previously sbd just waited for the cib‑connnection to show up/go away
> > which isn't robust at all.
> > The new feature needs new pacemakerd‑api as counterpart.
> > Thus build checks for presence of pacemakerd‑api.
> > To simplify downstream adoption behavior is configurable at runtime
> > via configure‑file with a build‑time‑configurable default.
> >   * refuse to start if qdevice‑sync_timeout doesn't match watchdog‑timeout
> > Needed in particular as qdevice‑sync_timeout delays quorum‑state‑update
> > and has a default of 30s that doesn't match the 5s watchdog‑timeout
> > default.
> >
> > ‑ Fix: sbd‑pacemaker: handle new no_quorum_demote + robustness against new
> >   policies added
> > ‑ Fix: agent: correctly compare string values when calculating timeout
> > ‑ Fix: scheduling: overhaul the whole thing
> >   * prevent possible lockup when format in proc changes
> >   * properly get and handle scheduler policy & prio
> >   * on SCHED_RR failing push to the max with SCHED_OTHER
>
> Do you also mess with ioprio/ionice?
>
> Regards,
> Ulrich
>
> >
> > Regards,
> > Klaus
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Final Pacemaker 2.0.5 release now available

2020-12-03 Thread Reid Wahl
On Thu, Dec 3, 2020 at 12:21 AM Reid Wahl  wrote:
>
> Have you tried `crm_mon -s`?
>
> # crm_mon --help-all | grep ' \-s'
>   -s, --simple-status   Display the cluster status once as
> a simple one line output (suitable for nagios)
>
> Caveat: This isn't without flaws.
>   - Bug 1576103 - `crm_mon -s` prints "CLUSTER OK" when there are
> unclean (online) nodes
> (https://bugzilla.redhat.com/show_bug.cgi?id=1576103)
>   - `crm_mon -s` prints `"CLUSTER OK"` when there are nodes in
> `UNCLEAN (online)` status
> (https://access.redhat.com/solutions/3441221)

Meant to also link:
  - Bug 1577085 - `crm_mon -s`: Improve printed outputs and return
codes (https://bugzilla.redhat.com/show_bug.cgi?id=1577085)
  - `crm_mon -s` return codes do not accurately reflect status of
cluster (https://access.redhat.com/solutions/3461161)


> I dunno if Check CRM still works, given that it was last updated 7 years ago:
>   - 
> https://exchange.nagios.org/directory/Plugins/Clustering-and-High-2DAvailability/Check-CRM/details
>
> On Wed, Dec 2, 2020 at 11:21 PM Ulrich Windl
>  wrote:
> >
> > >>> Christopher Lumens  schrieb am 02.12.2020 um 19:14 
> > >>> in
> > Nachricht <851583983.28225008.1606932881629.javamail.zim...@redhat.com>:
> > > Hi all,
> > >
> > > The final release of Pacemaker version 2.0.5 is now available at:
> > [...]
> > >
> > > * crm_mon additionally supports a --resource= option for resource-based
> > >   filtering, similar to the --node= option introduced in a previous 
> > > release.
> >
> > Another nice extension based on this would be a nagios-compatible output 
> > and exit code. I imagine:
> > OK if the resource is running (or is in its desired state)
> > WARNING if the resource is starting or stopping
> > CRITICAL if the resource is stopped (or not in ist desired state)
> > UNKNOWN if the status cannot be queried or the resource is not known.
> >
> > Of cource: Likewise for the nodes
> >
> > clones and master/slave probably would need some special care.
> >
> > Opinions on that?
> >
> > Regards,
> > Ulrich
> >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Final Pacemaker 2.0.5 release now available

2020-12-03 Thread Reid Wahl
Have you tried `crm_mon -s`?

# crm_mon --help-all | grep ' \-s'
  -s, --simple-status   Display the cluster status once as
a simple one line output (suitable for nagios)

Caveat: This isn't without flaws.
  - Bug 1576103 - `crm_mon -s` prints "CLUSTER OK" when there are
unclean (online) nodes
(https://bugzilla.redhat.com/show_bug.cgi?id=1576103)
  - `crm_mon -s` prints `"CLUSTER OK"` when there are nodes in
`UNCLEAN (online)` status
(https://access.redhat.com/solutions/3441221)

I dunno if Check CRM still works, given that it was last updated 7 years ago:
  - 
https://exchange.nagios.org/directory/Plugins/Clustering-and-High-2DAvailability/Check-CRM/details

On Wed, Dec 2, 2020 at 11:21 PM Ulrich Windl
 wrote:
>
> >>> Christopher Lumens  schrieb am 02.12.2020 um 19:14 in
> Nachricht <851583983.28225008.1606932881629.javamail.zim...@redhat.com>:
> > Hi all,
> >
> > The final release of Pacemaker version 2.0.5 is now available at:
> [...]
> >
> > * crm_mon additionally supports a --resource= option for resource-based
> >   filtering, similar to the --node= option introduced in a previous release.
>
> Another nice extension based on this would be a nagios-compatible output and 
> exit code. I imagine:
> OK if the resource is running (or is in its desired state)
> WARNING if the resource is starting or stopping
> CRITICAL if the resource is stopped (or not in ist desired state)
> UNKNOWN if the status cannot be queried or the resource is not known.
>
> Of cource: Likewise for the nodes
>
> clones and master/slave probably would need some special care.
>
> Opinions on that?
>
> Regards,
> Ulrich
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] pcs status shows all nodes online but pcs cluster status shows all nodes off line

2020-12-01 Thread Reid Wahl
You've added the high-availability service to the public zone. Can you
verify that the interface you're using for pcsd is bound to the public
zone?

Check whether the https_proxy environment variable is set. You may
need to unset it or to add it to /etc/sysconfig/pcsd.

Just an FYI, the `--name` option isn't used with `pcs cluster auth`.
It looks like it gets ignored though.

On Mon, Nov 30, 2020 at 5:47 AM John Karippery  wrote:
>
> I have problem while setup pacemaker on debian 9 servers. i have 3 server and 
> I installed
>
> apt install  pacemaker, corosync, pcsdfirewalld fence-agents
>
>
> pcs status
>
> pcs status
> Cluster name: vipcluster
> Stack: corosync
> Current DC: server1 (version 1.1.16-94ff4df) - partition with quorum
> Last updated: Mon Nov 30 14:43:36 2020
> Last change: Mon Nov 30 13:03:09 2020 by root via cibadmin on server1
>
> 3 nodes configured
> 2 resources configured
>
> Online: [ server1 server2 server3 ]
>
> Full list of resources:
>
>  MasterVip  (ocf::heartbeat:IPaddr2):   Started server1
>  Apache (ocf::heartbeat:apache):Started server1
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
>
>
>
>
> pcs cluster status
> Cluster Status:
>  Stack: corosync
>  Current DC: server1 (version 1.1.16-94ff4df) - partition with quorum
>  Last updated: Mon Nov 30 14:44:55 2020
>  Last change: Mon Nov 30 13:03:09 2020 by root via cibadmin on server1
>  3 nodes configured
>  2 resources configured
>
> PCSD Status:
>   server3: Offline
>   server1: Offline
>   server2: Offline
>
>
>
> error is showing while pcs auth
>
>  pcs cluster auth server1 server2 server3 --name vipcluster -u hacluster -p 
> 12345678 --debug
> Running: /usr/bin/ruby -I/usr/share/pcsd/ /usr/share/pcsd/pcsd-cli.rb auth
> --Debug Input Start--
> {"username": "hacluster", "local": false, "nodes": ["server1", "server2", 
> "server3"], "password": "12345678", "force": false}
> --Debug Input End--
> Return Value: 0
> --Debug Output Start--
> {
>   "status": "ok",
>   "data": {
> "auth_responses": {
>   "server1": {
> "status": "noresponse"
>   },
>   "server2": {
> "status": "noresponse"
>   },
>   "server3": {
> "status": "noresponse"
>   }
> },
> "sync_successful": true,
> "sync_nodes_err": [
>
> ],
> "sync_responses": {
> }
>   },
>   "log": [
> "I, [2020-11-30T14:46:24.758862 #9677]  INFO -- : PCSD Debugging 
> enabled\n",
> "D, [2020-11-30T14:46:24.758900 #9677] DEBUG -- : Did not detect RHEL 
> 6\n",
> "I, [2020-11-30T14:46:24.758919 #9677]  INFO -- : Running: 
> /usr/sbin/corosync-cmapctl totem.cluster_name\n",
> "I, [2020-11-30T14:46:24.758931 #9677]  INFO -- : CIB USER: hacluster, 
> groups: \n",
> "D, [2020-11-30T14:46:24.770175 #9677] DEBUG -- : [\"totem.cluster_name 
> (str) = vipcluster\\n\"]\n",
> "D, [2020-11-30T14:46:24.770373 #9677] DEBUG -- : []\n",
> "D, [2020-11-30T14:46:24.770444 #9677] DEBUG -- : Duration: 
> 0.011215661s\n",
> "I, [2020-11-30T14:46:24.770585 #9677]  INFO -- : Return Value: 0\n",
> "I, [2020-11-30T14:46:24.772514 #9677]  INFO -- : SRWT Node: server1 
> Request: check_auth\n",
> "E, [2020-11-30T14:46:24.772628 #9677] ERROR -- : Unable to connect to 
> node server1, no token available\n",
> "I, [2020-11-30T14:46:24.772943 #9677]  INFO -- : SRWT Node: server2 
> Request: check_auth\n",
> "E, [2020-11-30T14:46:24.773032 #9677] ERROR -- : Unable to connect to 
> node server2, no token available\n",
> "I, [2020-11-30T14:46:24.773202 #9677]  INFO -- : SRWT Node: server3 
> Request: check_auth\n",
> "E, [2020-11-30T14:46:24.773278 #9677] ERROR -- : Unable to connect to 
> node server3, no token available\n"
>   ]
> }
> --Debug Output End--
>
> Error: Unable to communicate with server1
> Error: Unable to communicate with server2
> Error: Unable to communicate with server3
>
>
> firewall settings
>
>
> # firewall-cmd --permanent --add-service=high-availability
> Warning: ALREADY_ENABLED: high-availability
> success
> ~# firewall-cmd --add-service=high-availability
> Warning: ALREADY_ENABLED: 'high-availability' already in 'public'
> success
>
>
>
>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question on restart of resource during fail over

2020-12-01 Thread Reid Wahl
Can you share your pacemaker configuration (i.e.,
/var/lib/pacemaker/cib/cib.xml)? If you're concerned about quorum,
then also share your /etc/corosync/corosync.conf just in case.

Also there's a db2 resource agent already written, if you're interested:
- https://github.com/oalbrigt/resource-agents/blob/master/heartbeat/db2

On Tue, Dec 1, 2020 at 9:50 AM Harishkumar Pathangay
 wrote:
>
> Hi,
>
> I have DB2 resource agent scripted by myself.
>
> It is working fine with a small glitch.
>
>
>
> I have node1 and node2 in the cluster. No stonith enabled as I don't need 
> one. The environment is for learning purpose only.
>
>
>
> If node one is down [power off], it is starting the resource on other node 
> which is good. My custom resource agent doing its job. Let us say DB2 is 
> running with pid 4567.
>
>
>
> Now, the original node which went down is back again.  I issue “pcs cluster 
> start” on the node. Node is online. The resource also stays in the other 
> node, which is again good. That way unnecessary movement of resources is 
> avoided, exactly what I want. Good but there is a issue.
>
> On the other node it is restarting the DB2 resource. So my pid of db2 changes 
> to 3452.
>
> This is unnecessary restart of resource which I want to avoid.
>
> How to I get this working.
>
>
>
> I am very new to cluster pacemaker.
>
> Please help me so that I can create a working DB2 cluster for my learning 
> purpose.
>
> Also I will be blogging in my youtube channel DB2LUWACADEMY.
>
> Please any help is of great significance to me.
>
>
>
> I think it could be quorum issue. But don't know for sure, because there is 
> only two nodes and DB2 resource needs to be active only in one node.
>
>
>
> How do I get this configured.
>
>
>
> Thanks.
>
> Harish P
>
>
>
>
>
> Sent from Mail for Windows 10
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question on restart of resource during fail over

2020-12-02 Thread Reid Wahl
On Wed, Dec 2, 2020 at 2:16 AM Harishkumar Pathangay
 wrote:
>
> Just got the issue resolved.

Nice work!

> Any case I will send the cib.xml and my custom db2 resource agent.
>
> The existing resource agent is for HADR database, where there are two 
> databases one running as Primary and other as standby.

HADR is only one option. There's also a standard mode:
  - 
https://github.com/oalbrigt/resource-agents/blob/master/heartbeat/db2#L64-L69

I don't know much about DB2, so I'm not sure whether that would meet
your needs. Based on the metadata, standard mode appears to manage a
single instance (with the databases you select) on one node at a time.

> I have created a script which will start/stop db2 instances with a single 
> database on shared logical volume [HA-LVM] exclusively activated on one node.
>
>
>
> Will mail you shortly.
>
>
>
> Thanks,
>
> Harish P
>
>
>
> Sent from Mail for Windows 10
>
>
>
> From: Reid Wahl
> Sent: 02 December 2020 12:46
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Question on restart of resource during fail over
>
>
>
> Can you share your pacemaker configuration (i.e.,
> /var/lib/pacemaker/cib/cib.xml)? If you're concerned about quorum,
> then also share your /etc/corosync/corosync.conf just in case.
>
> Also there's a db2 resource agent already written, if you're interested:
> - https://github.com/oalbrigt/resource-agents/blob/master/heartbeat/db2
>
> On Tue, Dec 1, 2020 at 9:50 AM Harishkumar Pathangay
>  wrote:
> >
> > Hi,
> >
> > I have DB2 resource agent scripted by myself.
> >
> > It is working fine with a small glitch.
> >
> >
> >
> > I have node1 and node2 in the cluster. No stonith enabled as I don't need 
> > one. The environment is for learning purpose only.
> >
> >
> >
> > If node one is down [power off], it is starting the resource on other node 
> > which is good. My custom resource agent doing its job. Let us say DB2 is 
> > running with pid 4567.
> >
> >
> >
> > Now, the original node which went down is back again.  I issue “pcs cluster 
> > start” on the node. Node is online. The resource also stays in the other 
> > node, which is again good. That way unnecessary movement of resources is 
> > avoided, exactly what I want. Good but there is a issue.
> >
> > On the other node it is restarting the DB2 resource. So my pid of db2 
> > changes to 3452.
> >
> > This is unnecessary restart of resource which I want to avoid.
> >
> > How to I get this working.
> >
> >
> >
> > I am very new to cluster pacemaker.
> >
> > Please help me so that I can create a working DB2 cluster for my learning 
> > purpose.
> >
> > Also I will be blogging in my youtube channel DB2LUWACADEMY.
> >
> > Please any help is of great significance to me.
> >
> >
> >
> > I think it could be quorum issue. But don't know for sure, because there is 
> > only two nodes and DB2 resource needs to be active only in one node.
> >
> >
> >
> > How do I get this configured.
> >
> >
> >
> > Thanks.
> >
> > Harish P
> >
> >
> >
> >
> >
> > Sent from Mail for Windows 10
> >
> >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Changing order in resource group after it's created

2020-12-17 Thread Reid Wahl
I agree that it is somewhat counter-intuitive. It is in the `--help`
output, however -- although you have to already have an idea of where
to look in order to find it ;).

[root@fastvm-rhel-7-6-21 ~]# pcs resource group --help

Usage: pcs resource group...
group add   [resource id] ... [resource id]
  [--before  | --after ] [--wait[=n]]
Add the specified resource to the group, creating the group if it does
not exist. If the resource is present in another group it is moved to
the new group. You can use --before or --after to specify the position
of the added resources relatively to some resource already existing in
the group. By adding resources to a group they are already in and
specifying --after or --before you can move the resources in the group.
If --wait is specified, pcs will wait up to 'n' seconds for the
operation to finish (including moving resources if appropriate) and
then return 0 on success or 1 on error. If 'n' is not specified it
defaults to 60 minutes.


Note: "By adding resources to a group they are already in and
specifying --after or --before you can move the resources in the
group."

On Thu, Dec 17, 2020 at 3:39 AM Tony Stocker  wrote:
>
> On Thu, Dec 17, 2020 at 6:29 AM Ulrich Windl
>  wrote:
> >
> > >>> Tony Stocker  schrieb am 17.12.2020 um 12:21 in
> > Nachricht
> > :
> > > I have a resource group that has a number of entries. If I want to
> > > reorder them, how do I do that?
> > >
> > > I tried doing this:
> > >
> > > pcs resource update FileMount ‑‑after InternalIP
> > >
> > > but got this error:
> > >
> > > Error: Specified option '‑‑after' is not supported in this command
>
> >
> > I have no experince with pcs, but with crm I'd do:
> > enable maintenance mode (if you cannot restart the group)
> > "crm configure edit 
> > disable maintenance mode (the cluster should see that everything that is t 
> > be
> > started is started, and things should be OK)
> >
> > Regards,
> > Ulrich
>
> Thanks. I think I may have figured out the way using pcs, it seems
> somewhat counterintuitive to me but you have to use 'pcs resource
> group' as if you're creating a new group, but you use the existing
> group id, so the command looks like this:
>
> pcs resource group add webserver FileMount --after InternalIP
>
>
> It appears to work, so I suppose I should have waited before posting
> but that's usually how it is. I bang my head against the desk for an
> hour without finding something, then I post a question and magically I
> find the answer right afterwards. C'est la guerre!
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Update password of hacluster user in live Production

2020-12-10 Thread Reid Wahl
Should be fine. AFAIK the only thing that uses password auth for the
hacluster user is the `pcs cluster auth` command (and maybe something
within the crmsh toolset). I don't think anything within Pacemaker
uses password auth for hacluster.

On Thu, Dec 10, 2020 at 7:53 PM rahim_...@yahoo.com  wrote:
>
> Hi,
>
> Can we update password of hacluster user in live Production systems with 
> RedHat Pacemaker cluster?  What is the impact on a running 
> system/applications hosted on the cluster?
>
> Thanks.
> Abdul
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Recoveing from node failure

2020-12-11 Thread Reid Wahl
Hi, Gabriele. It sounds like you don't have quorum on node 1.
Resources won't start unless the node is part of a quorate cluster
partition.

You probably have "two_node: 1" configured by default in
corosync.conf. This setting automatically enables wait_for_all.

>From the votequorum(5) man page:

   NOTES:  enabling  two_node:  1  automatically  enables
wait_for_all. It is still possible to override wait_for_all by
explicitly setting it to 0.  If more than 2 nodes join the cluster,
the two_node
   option is automatically disabled.

   wait_for_all: 1

   Enables Wait For All (WFA) feature (default: 0).

   The general behaviour of votequorum is to switch a cluster from
inquorate to quorate as soon as possible. For example, in an 8 node
cluster, where every node has 1 vote, expected_votes is set  to  8
   and quorum is (50% + 1) 5. As soon as 5 (or more) nodes are
visible to each other, the partition of 5 (or more) becomes quorate
and can start operating.

   When WFA is enabled, the cluster will be quorate for the first
time only after all nodes have been visible at least once at the same
time.

   This feature has the advantage of avoiding some startup race
conditions, with the cost that all nodes need to be up at the same
time at least once before the cluster can operate.

You can either unblock quorum (`pcs quorum unblock` with pcs -- not
sure how to do it with crmsh) or set `wait_for_all: 0` in
corosync.conf and restart the cluster services.

On Fri, Dec 11, 2020 at 2:23 AM Gabriele Bulfon  wrote:
>
> Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos 
> storage cluster.
> I have NFS IPs and shared storage zpool moving from one node or the other, 
> and stonith controllin ipmi powering off when something is not clear.
>
> What happens now is that if I shutdown 2nd node, I see the OFFLINE status 
> from node 1 and everything is up and running, and this is ok:
>
>
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
>
> Full list of resources:
>
>  xstha1_san0_IP  (ocf::heartbeat:IPaddr):Started xstha1
>  xstha2_san0_IP  (ocf::heartbeat:IPaddr):Started xstha1
>  xstha1-stonith  (stonith:external/ipmi):Started xstha1
>  xstha2-stonith  (stonith:external/ipmi):Started xstha1
>  zpool_data  (ocf::heartbeat:ZFS):   Started xstha1
>
> But if also reboot 1st node, it starts with the UNCLEAN state, nothing is 
> running, so I clearstate of node 2, but resources are not started:
>
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
>
> Full list of resources:
>
>  xstha1_san0_IP  (ocf::heartbeat:IPaddr):Stopped
>  xstha2_san0_IP  (ocf::heartbeat:IPaddr):Stopped
>  xstha1-stonith  (stonith:external/ipmi):Stopped
>  xstha2-stonith  (stonith:external/ipmi):Stopped
>  zpool_data  (ocf::heartbeat:ZFS):   Stopped
>
> I tried restarting zpool_data or other resources:
>
> # crm resource start zpool_data
>
> but nothing happens!
> How can I recover from this state? Node2 needs to stay down, but I want node1 
> to work.
>
> Thanks!
> Gabriele
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue

2020-11-15 Thread Reid Wahl
ocf:heartbeat:db2inst isn't a resource agent that ClusterLabs maintains, so
we have no insight into how it works and why it's in Started state. I don't
know based on the output what resource agent db2_db2ins11_db2ins11_QUERYDB
is using.

I recommend that you take a look at the resource agent's monitor function
and see what it's actually checking. Then you can determine why the monitor
operation is succeeding, and you can modify the function so that the
monitor operation will fail when the DB is down.

On Sun, Nov 15, 2020 at 8:52 AM Guy Przytula 
wrote:

> I have installed latest version of pacemaker on redhat 8
>
> I wanted to test it out for a cluster for IBM Db2
>
> There is only one issue :
>
> I have 2 nodes : nodep and nodes
>
> the database/instance resource are primary on nodep
>
> if I stop the process (db2) on nodes : it is automatically started : ok
>
> now I renamed the startup command, so the process can not startup anymore
>
> I see the process is down : but the status of cluster is normal
>
> in the screen you can see : msg : no start database 
>
> go to root : execute crm status : all started 
>
> --
> Best Regards, Beste Groeten,  Meilleures Salutations
> *Guy Przytula*
>
> Tel. GSM : +32 (0)475-33.81.86
>
> Email : Guy Przytula 
>
>
>   [image: signature]
>
>
>
> Infocura - Tel : +32 (0) 478 32 83 54
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] stop a node

2020-11-15 Thread Reid Wahl
Put the node in standby mode if you want to keep it as an active
member of the cluster but don't want to allow it to run any resources.

Stop Pacemaker (and possibly corosync, depending on your needs) if you
want to prevent the node from being an active member of the cluster.

On Sun, Nov 15, 2020 at 12:00 PM Andrei Borzenkov  wrote:
>
> 15.11.2020 20:00, Guy Przytula пишет:
> > a question would be :
> >
> > we have maintenance to perform on a node of the cluster
> >
> > to avoid that the cluster starts the resource that we stopped - we want
> > to disable a node temporarily - is this possible without deleting the node
> >
>
>
> Put node in standby using crm_standby or any high level tool you are
> using (pcs, crmsh, hawk, ...).
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue

2020-11-17 Thread Reid Wahl
On Mon, Nov 16, 2020 at 11:23 PM Guy Przytula 
wrote:

> sorry for coming back and thanks for the answers
>
> but how do you make a relation between your resource(s) and the script ?
>
> a link to a doc would be nice..  so I do not need to ask questions to the
> group..
>

Ken provided the following link in the "integration" thread, with
information about OCF resources. Are you requesting something different?
-
https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc

> Best Regards, Beste Groeten,  Meilleures Salutations
> *Guy Przytula*
>
> Tel. GSM : +32 (0)475-33.81.86
>
> Email : Guy Przytula 
>
>
>   [image: signature]
>
>
>
> Infocura - Tel : +32 (0) 478 32 83 54
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

On Mon, Nov 16, 2020 at 11:23 PM Guy Przytula 
wrote:

> sorry for coming back and thanks for the answers
>
> but how do you make a relation between your resource(s) and the script ?
>
> a link to a doc would be nice..  so I do not need to ask questions to the
> group..
> Best Regards, Beste Groeten,  Meilleures Salutations
> *Guy Przytula*
>
> Tel. GSM : +32 (0)475-33.81.86
>
> Email : Guy Przytula 
>
>
>   [image: signature]
>
>
>
> Infocura - Tel : +32 (0) 478 32 83 54
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Fwd: issue

2020-11-17 Thread Reid Wahl
Forwarding Guy's response to me, onward to the list.

I work on the cluster support team at Red Hat and I'm not aware that we
ship this db2cm script. Since you've said the script is defined by IBM and
also Red Hat, where did you obtain the script?

The resource agent script for ocf:heartbeat:db2inst should be located at
/usr/lib/ocf/resource/heartbeat/db2inst. It should have one or more
functions that it runs when the monitor operation is executed.

If the resource is showing "Started" state when the DB is in fact down,
then one of two explanations is likely:
- The function that is run for a monitor operation doesn't properly check
the DB status. If this is the case, then it sounds like a bug in the
script. The logic in that function needs to be modified. I would suggest
reaching out to whoever maintains the
/usr/lib/ocf/resource.d/heartbeat/db2inst script.
- No recurring monitor operation has been defined for the resource, so
Pacemaker is not even checking the resource's status. If this is the case,
then a monitor operation needs to be added.

-- Forwarded message -
From: Guy Przytula 
Date: Tue, Nov 17, 2020 at 2:33 AM
Subject: Re: [ClusterLabs] issue
To: Reid Wahl 


we are using a db2 (ibm and also redhat) defined script db2cm to create the
resources and handling these. we do not handle the scripts for
start/stop/monitor

in current version of db2 pacemaker is shipped as a tech-preview, but in
next release, pacemaker will be officially supported for db2

if the problem persist, we can open a ticket for this at ibm

-- 
Best Regards, Beste Groeten,  Meilleures Salutations
*Guy Przytula*

Tel. GSM : +32 (0)475-33.81.86

Email : Guy Przytula 


  [image: signature]



Infocura - Tel : +32 (0) 478 32 83 54


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue

2020-11-17 Thread Reid Wahl
Guy emailed me again directly, stating that the mailing list does not
accept URLs. Guy, I'm honestly not sure what you're referring to,
since Ken and I have both posted URLs within this thread.

~~~
we downloaded pacemaker from :

https://mrs-ux.mrs-prod-7d4bdc08e7ddc90fa89b373d95c240eb-.us-south.containers.appdomain.cloud/marketing/iwm/platform/mrs/assets/DownloadList?source=mrs-db2pcmk=en_US

in the subdirectory : Db2   there is db2cm located

in Db2agents we have the files to be copied to ...heartbeat
~~~

I followed that link, and it goes to an IBM site. I don't have login
credentials, but I don't see any indication that the db2cm script or
the Db2agents script are developed or maintained by Red Hat or by
ClusterLabs.

In addition to the previous guidance, I second Ken's pointer to the
"Pacemaker Explained" link. That's a fairly comprehensive Pacemaker
configuration document. Personally, most of my experience is in using
the pcs command-line tool to administer Pacemaker, and I recommend
looking into that as well if you're not already using it.

On Tue, Nov 17, 2020 at 7:39 AM Ken Gaillot  wrote:
>
> On Tue, 2020-11-17 at 08:23 +0100, Guy Przytula wrote:
> > sorry for coming back and thanks for the answers
> > but how do you make a relation between your resource(s) and the
> > script ?
>
> You would configure a resource in the Pacemaker configuration,
> specifying the agent (script), resource parameters, and operations (a
> recurring monitor, and possibly custom timeouts for specific actions).
>
> > a link to a doc would be nice..  so I do not need to ask questions to
> > the group..
>
> The full doc for Pacemaker configuration (using XML) is:
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html
>
> Most people use a higher-level tool such as crm shell or pcs to have a
> more friendly interface than XML. Those tools will come with their own
> docs (the man page is a good start).
>
> > Best Regards, Beste Groeten,  Meilleures Salutations
> > Guy Przytula
> >
> > Tel. GSM : +32 (0)475-33.81.86
> >
> > Email : Guy Przytula
> >
> >
> >
> >
> >
> >
> > Infocura - Tel : +32 (0) 478 32 83 54
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue

2020-11-17 Thread Reid Wahl
Thanks so much, Gerry. I'm glad that we know the right direction to point
toward now.

On Tuesday, November 17, 2020, Gerry R Sommerville  wrote:
> Hey Reid, and Guy,
>
> I want to clarify that Guy is using the Db2 built and packaged version of
Pacemaker and Corosync released for technical preview to be used with Db2
V11.5.4.0. The db2cm utility and db2 specific resource agents (db2ethmon,
db2inst, db2hadr) were developed and packaged by Db2, not by Redhat or
Clusterlabs. Support is limited since its technical preview but he should
reach out to Db2 for questions about these scripts so we can make
improvements if necessary.
>
> I have reached out to Guy directly and can hopefully answer his questions.
>
> Gerry Sommerville
> Db2 Development, pureScale Domain
> E-mail: ge...@ca.ibm.com
>
>
>
> ----- Original message -
> From: Reid Wahl 
> Sent by: "Users" 
> To: Cluster Labs - All topics related to open-source clustering welcomed <
users@clusterlabs.org>
> Cc:
> Subject: [EXTERNAL] [ClusterLabs] Fwd: issue
> Date: Tue, Nov 17, 2020 5:48 AM
>
> Forwarding Guy's response to me, onward to the list.
>
> I work on the cluster support team at Red Hat and I'm not aware that we
ship this db2cm script. Since you've said the script is defined by IBM and
also Red Hat, where did you obtain the script?
>
> The resource agent script for ocf:heartbeat:db2inst should be located at
/usr/lib/ocf/resource/heartbeat/db2inst. It should have one or more
functions that it runs when the monitor operation is executed.
>
> If the resource is showing "Started" state when the DB is in fact down,
then one of two explanations is likely:
> - The function that is run for a monitor operation doesn't properly check
the DB status. If this is the case, then it sounds like a bug in the
script. The logic in that function needs to be modified. I would suggest
reaching out to whoever maintains the
/usr/lib/ocf/resource.d/heartbeat/db2inst script.
> - No recurring monitor operation has been defined for the resource, so
Pacemaker is not even checking the resource's status. If this is the case,
then a monitor operation needs to be added.
>
> -- Forwarded message -
> From: Guy Przytula 
> Date: Tue, Nov 17, 2020 at 2:33 AM
> Subject: Re: [ClusterLabs] issue
> To: Reid Wahl 
>
>
> we are using a db2 (ibm and also redhat) defined script db2cm to create
the resources and handling these. we do not handle the scripts for
start/stop/monitor
>
> in current version of db2 pacemaker is shipped as a tech-preview, but in
next release, pacemaker will be officially supported for db2
>
> if the problem persist, we can open a ticket for this at ibm
>
>
>
> --
> Best Regards, Beste Groeten,  Meilleures Salutations
> Guy Przytula
>
> Tel. GSM : +32 (0)475-33.81.86
>
> Email : Guy Przytula
>
>




>

> Infocura - Tel : +32 (0) 478 32 83 54
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
> ___
> Manage your subscription:
>
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.clusterlabs.org_mailman_listinfo_users=DwICAg=jf_iaSHvJObTbx-siA1ZOg=wM6kshEI2xiGJ3-6yiswtA=5eW2lZd_s1sKR4xUyvRBxqIderys7WLTy1TM_6PFDMM=LJfrf72lxdACmdclxNw-v3Q6kaYAhPB7mDWGlPGrGtI=

>
> ClusterLabs home:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.clusterlabs.org_=DwICAg=jf_iaSHvJObTbx-siA1ZOg=wM6kshEI2xiGJ3-6yiswtA=5eW2lZd_s1sKR4xUyvRBxqIderys7WLTy1TM_6PFDMM=grYaEHjsYHDmVgDis3zuFzmfqhXsq42Sem8HE2AI1iQ=

>
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fencing explanation

2020-12-28 Thread Reid Wahl
Hi, Ignazio. You can set either the delay in one of two ways:
- Using the `delay` attribute, whose value is a bare integer
(representing the number of seconds). This is implemented within the
fencing library (/usr/share/fence/fencing.py).
- Using the `pcmk_delay_base` attribute, whose value is more flexible
(e.g., "60", "60s", "1m") as shown below. This is implemented within
Pacemaker's fencer component.

 * \param[in] input  Pacemaker time interval specification (a bare number of
 *   seconds, a number with a unit optionally with whitespace
 *   before and/or after the number, or an ISO 8601 duration)

In practice, I don't believe it matters which one you use. I see the
`delay` attribute used more commonly than the `pcmk_delay_base`
attribute.

For some additional info:
- Table 13.1. Additional Properties of Fencing Resources
(https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#idm140583403103104)

On Mon, Dec 28, 2020 at 1:22 PM Ignazio Cassano
 wrote:
>
> Hello all, I am setting a pacemaker cluster with centos 7 and ipmi idrac 
> fencing devices.
> What I did not understand is how set the number of seconds before a node is 
> rebooted by stonith.
> If the cluster is made up 3 nodes (A, B, C) if the node C is unreacheable 
> (for example have network cards corrupetd) after how many second is rebooted 
> by stonith ?
> Which is the parameter to set the number of seconds?
> Sorry for my bad english
> Thanks
> Ignazio
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Calling crm executables via effective uid

2021-01-07 Thread Reid Wahl
Seems that the SCM_CREDENTIALS ancillary message passes the real UID rather
than the effective UID in the ucred struct. It looks like that's where we
get a value for ugp.uid.

I wonder if there's any way to work around this and whether it's intended
behavior. Based on variable naming (c->euid), libqb seems to expect an
effective UID. For example:

https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c#L497
https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c#L652

I'm way out of my depth here :) Just seemed fun to dig into.


On Thu, Jan 7, 2021 at 7:08 PM Reid Wahl  wrote:

>
> On Thu, Jan 7, 2021 at 6:16 PM Reid Wahl  wrote:
>
>> For whatever reason, the IPC from the crm_mon client to the CIB
>> manager is getting opened with the real UID ("testuser" in my case)
>> instead of the effective UID. The CIB manager checks this unprivileged
>> user against the ACL list and pre-filters the entire CIB, causing a
>> "Permission denied" error.
>>
>> What I haven't figured out yet (if I even keep going down this rabbit
>> hole) is why the IPC is attached to the real UID even though the
>> executable is owned by cmadmin with the setuid bit enabled.
>>
>
> Seems to be getting set within libqb, and I'm not sure whether it's
> intentional that
>
> It doesn't seem to be possible to debug cmadmin's SUID process when
> running it as testuser. I'm not particularly familiar with libqb and am not
> sure how to debug further without building libqb from source and adding
> tracing, which is a whole other can of worms.
>
>
>> On Mon, Dec 14, 2020 at 4:41 AM Klaus Wenninger 
>> wrote:
>> >
>> > On 12/11/20 10:20 PM, Alex Zarifoglu wrote:
>> > > Hello,
>> > >
>> > > I have question regarding the running crm commands with the effective
>> uid.
>> > >
>> > > I am trying to create a tool to manage pacemaker resources for
>> > > multiple users. For security reasons, these users will only be able to
>> > > create/delete/manage resources that can impact that specific user
>> > > only. I cannot achieve this via ACLs because it is not possible to
>> > > enforce every user to only create primitives with certain parameters,
>> > > rules etc.
>> > >
>> > > Therefore, I created a user called cmadmin which has full write access
>> > > to the cib. And created an executable which is owned by this user and
>> > > has the setuid and setgid bits set.
>> > >
>> > > -r-sr-s--x   1 cmadmin cmadmin 24248 Dec 11 07:04 cmexc
>> > >
>> > > Within this executable I have the code:
>> > >
>> > >  pid_tpid;
>> > >  char*constparmList[] = {"/sbin/crm_mon", "-1", "-VVV", NULL};
>> > >
>> > >  if((pid = fork()) == -1)
>> > > perror("fork error");
>> > >  else if(pid == 0) {
>> > > execv("/sbin/crm_mon", parmList);
>> > > printf("execv error");
>> > >  }
>> > >
>> > >
>> > > If I run this with a user other than cmadmin, crm_mon fails. I tested
>> > > with another executable to make sure effective user id is passed in
>> > > correctly and it worked fine.
>> > >
>> > > Checking the trace, we fail here with eacces permission denied:
>> > > |(crm_ipc_send)   trace: Sending cib_ro IPC request 5 of 191 bytes
>> > > using 12ms timeout|
>> > > |(internal_ipc_get_reply) trace: client cib_ro waiting on reply to msg
>> > > id 5|
>> > > |(crm_ipc_send)   trace: Received 179-byte reply 5 to cib_ro IPC 5:
>> > > > > > cib_clientid="f58912bf-cab6-4d1b-9025-701fc147c|
>> > > |(cib_native_perform_op_delegate) trace: Reply   > > > cib_op="cib_query" cib_callid="2"
>> > > cib_clientid="f58912bf-cab6-4d1b-9025-701fc147c6cd" cib_callopt="4352"
>> > > *cib_rc="-13"*/>|
>> > >
>> > > I tested with other pacemaker commands and got similar results. I’ve
>> > > also tried adding users to haclient group (not to acls just to the
>> > > group) with no success.
>> > >
>> > > Is it not possible to change effective uids and call crm executables?
>> > > If so why and is there way I can achieve what I need differently?
>> > Are you running with selinux enforcing?
>> > Not saying you shouldn't - just to narrow down ...
&

Re: [ClusterLabs] Calling crm executables via effective uid

2021-01-07 Thread Reid Wahl
On Thu, Jan 7, 2021 at 6:16 PM Reid Wahl  wrote:

> For whatever reason, the IPC from the crm_mon client to the CIB
> manager is getting opened with the real UID ("testuser" in my case)
> instead of the effective UID. The CIB manager checks this unprivileged
> user against the ACL list and pre-filters the entire CIB, causing a
> "Permission denied" error.
>
> What I haven't figured out yet (if I even keep going down this rabbit
> hole) is why the IPC is attached to the real UID even though the
> executable is owned by cmadmin with the setuid bit enabled.
>

Seems to be getting set within libqb, and I'm not sure whether it's
intentional that

It doesn't seem to be possible to debug cmadmin's SUID process when running
it as testuser. I'm not particularly familiar with libqb and am not sure
how to debug further without building libqb from source and adding tracing,
which is a whole other can of worms.


> On Mon, Dec 14, 2020 at 4:41 AM Klaus Wenninger 
> wrote:
> >
> > On 12/11/20 10:20 PM, Alex Zarifoglu wrote:
> > > Hello,
> > >
> > > I have question regarding the running crm commands with the effective
> uid.
> > >
> > > I am trying to create a tool to manage pacemaker resources for
> > > multiple users. For security reasons, these users will only be able to
> > > create/delete/manage resources that can impact that specific user
> > > only. I cannot achieve this via ACLs because it is not possible to
> > > enforce every user to only create primitives with certain parameters,
> > > rules etc.
> > >
> > > Therefore, I created a user called cmadmin which has full write access
> > > to the cib. And created an executable which is owned by this user and
> > > has the setuid and setgid bits set.
> > >
> > > -r-sr-s--x   1 cmadmin cmadmin 24248 Dec 11 07:04 cmexc
> > >
> > > Within this executable I have the code:
> > >
> > >  pid_tpid;
> > >  char*constparmList[] = {"/sbin/crm_mon", "-1", "-VVV", NULL};
> > >
> > >  if((pid = fork()) == -1)
> > > perror("fork error");
> > >  else if(pid == 0) {
> > > execv("/sbin/crm_mon", parmList);
> > > printf("execv error");
> > >  }
> > >
> > >
> > > If I run this with a user other than cmadmin, crm_mon fails. I tested
> > > with another executable to make sure effective user id is passed in
> > > correctly and it worked fine.
> > >
> > > Checking the trace, we fail here with eacces permission denied:
> > > |(crm_ipc_send)   trace: Sending cib_ro IPC request 5 of 191 bytes
> > > using 12ms timeout|
> > > |(internal_ipc_get_reply) trace: client cib_ro waiting on reply to msg
> > > id 5|
> > > |(crm_ipc_send)   trace: Received 179-byte reply 5 to cib_ro IPC 5:
> > >  > > cib_clientid="f58912bf-cab6-4d1b-9025-701fc147c|
> > > |(cib_native_perform_op_delegate) trace: Reply> > cib_op="cib_query" cib_callid="2"
> > > cib_clientid="f58912bf-cab6-4d1b-9025-701fc147c6cd" cib_callopt="4352"
> > > *cib_rc="-13"*/>|
> > >
> > > I tested with other pacemaker commands and got similar results. I’ve
> > > also tried adding users to haclient group (not to acls just to the
> > > group) with no success.
> > >
> > > Is it not possible to change effective uids and call crm executables?
> > > If so why and is there way I can achieve what I need differently?
> > Are you running with selinux enforcing?
> > Not saying you shouldn't - just to narrow down ...
> >
> > Klaus
> > >
> > > Thank you,
> > > Alex
> > >
> > >
> > > *Alex Zarifoglu*
> > > Software Developer *|* *Db2* pureScale
> > >
> > >
> > > ___
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Calling crm executables via effective uid

2021-01-07 Thread Reid Wahl
For whatever reason, the IPC from the crm_mon client to the CIB
manager is getting opened with the real UID ("testuser" in my case)
instead of the effective UID. The CIB manager checks this unprivileged
user against the ACL list and pre-filters the entire CIB, causing a
"Permission denied" error.

What I haven't figured out yet (if I even keep going down this rabbit
hole) is why the IPC is attached to the real UID even though the
executable is owned by cmadmin with the setuid bit enabled.

On Mon, Dec 14, 2020 at 4:41 AM Klaus Wenninger  wrote:
>
> On 12/11/20 10:20 PM, Alex Zarifoglu wrote:
> > Hello,
> >
> > I have question regarding the running crm commands with the effective uid.
> >
> > I am trying to create a tool to manage pacemaker resources for
> > multiple users. For security reasons, these users will only be able to
> > create/delete/manage resources that can impact that specific user
> > only. I cannot achieve this via ACLs because it is not possible to
> > enforce every user to only create primitives with certain parameters,
> > rules etc.
> >
> > Therefore, I created a user called cmadmin which has full write access
> > to the cib. And created an executable which is owned by this user and
> > has the setuid and setgid bits set.
> >
> > -r-sr-s--x   1 cmadmin cmadmin 24248 Dec 11 07:04 cmexc
> >
> > Within this executable I have the code:
> >
> >  pid_tpid;
> >  char*constparmList[] = {"/sbin/crm_mon", "-1", "-VVV", NULL};
> >
> >  if((pid = fork()) == -1)
> > perror("fork error");
> >  else if(pid == 0) {
> > execv("/sbin/crm_mon", parmList);
> > printf("execv error");
> >  }
> >
> >
> > If I run this with a user other than cmadmin, crm_mon fails. I tested
> > with another executable to make sure effective user id is passed in
> > correctly and it worked fine.
> >
> > Checking the trace, we fail here with eacces permission denied:
> > |(crm_ipc_send)   trace: Sending cib_ro IPC request 5 of 191 bytes
> > using 12ms timeout|
> > |(internal_ipc_get_reply) trace: client cib_ro waiting on reply to msg
> > id 5|
> > |(crm_ipc_send)   trace: Received 179-byte reply 5 to cib_ro IPC 5:
> >  > cib_clientid="f58912bf-cab6-4d1b-9025-701fc147c|
> > |(cib_native_perform_op_delegate) trace: Reply> cib_op="cib_query" cib_callid="2"
> > cib_clientid="f58912bf-cab6-4d1b-9025-701fc147c6cd" cib_callopt="4352"
> > *cib_rc="-13"*/>|
> >
> > I tested with other pacemaker commands and got similar results. I’ve
> > also tried adding users to haclient group (not to acls just to the
> > group) with no success.
> >
> > Is it not possible to change effective uids and call crm executables?
> > If so why and is there way I can achieve what I need differently?
> Are you running with selinux enforcing?
> Not saying you shouldn't - just to narrow down ...
>
> Klaus
> >
> > Thank you,
> > Alex
> >
> >
> > *Alex Zarifoglu*
> > Software Developer *|* *Db2* pureScale
> >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



--
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] DRBD ms resource keeps getting demoted

2021-01-18 Thread Reid Wahl
Can you share the cluster configuration (e.g., `pcs config` or the CIB)?
And are there any additional LogAction messages after that one (e.g.,
Promote for node01)?

On Mon, Jan 18, 2021 at 7:47 PM Stuart Massey  wrote:

> So, we have a 2-node cluster with a quorum device. One of the nodes
> (node1) is having some trouble, so we have added constraints to prevent any
> resources migrating to it, but have not put it in standby, so that drbd in
> secondary on that node stays in sync. The problems it is having lead to OS
> lockups that eventually resolve themselves - but that causes it to be
> temporarily dropped from the cluster by the current master (node2).
> Sometimes when node1 rejoins, then node2 will demote the drbd ms resource.
> That causes all resources that depend on it to be stopped, leading to a
> service outage. They are then restarted on node2, since they can't run on
> node1 (due to constraints).
> We are having a hard time understanding why this happens. It seems like
> there may be some sort of DC contention happening. Does anyone have any
> idea how we might prevent this from happening?
> Selected messages (de-identified) from pacemaker.log that illustrate
> suspicion re DC confusion are below. The update_dc and
> abort_transition_graph re deletion of lrm seem to always precede the
> demotion, and a demotion seems to always follow (when not already demoted).
>
> Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> do_dc_takeover:Taking over DC status for this partition
> Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> update_dc: Set DC to node02.example.com (3.0.14)
> Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> abort_transition_graph:Transition aborted by deletion of
> lrm[@id='1']: Resource state removal | cib=0.89.327
> source=abort_unless_down:357
> path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
> Jan 18 16:52:19 [21937] node02.example.compengine: info:
> master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible 1 to
> master
> Jan 18 16:52:19 [21937] node02.example.compengine:   notice:
> LogAction:  * Demote drbd_ourApp:1 (Master -> Slave
> node02.example.com )
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: What's a "transition", BTW?

2021-01-18 Thread Reid Wahl
76.bz2
Jan 18 23:12:13 fastvm-rhel-8-0-23 pacemaker-schedulerd[7699]
(pcmk__log_transition_summary@pcmk_sched_allocate.c:2897) notice:
Calculated transition 1007, saving inputs in
/var/lib/pacemaker/pengine/pe-input-376.bz2


> I could imagine reusing
> the last number if the last transition had no actions other than
> monitor/probe.
> Of course that would not work if inputs are interleaved (the next begins
> before the last one has finished).
>
> Regards,
> Ulrich
>
>
> > ‑‑
> > Ken Gaillot 
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Observed Difference Between ldirectord and keepalived

2021-01-02 Thread Reid Wahl
Hi, Eric.

To the best of my knowledge, neither ldirectord nor keepalived is part of
the ClusterLabs project. It looks like the keepalived user group is here:
https://www.keepalived.org/listes.html

Is there anything we can help you with regarding the ClusterLabs software?



On Sat, Jan 2, 2021 at 1:12 AM Eric Robinson 
wrote:

> We recently switched from ldirectord to keepalived. We noticed that, after
> the switch, LVS behaves a bit differently with respect to “down” services.
>
>
>
> On ldirectord, a virtual service with 2 realservers displays “Masq0”
> when one of them is down.
>
>
>
> TCP  192.168.5.100:3002 wlc persistent 50
>
>   -> 192.168.8.53:3002Masq1  0  0
>
>   -> 192.168.8.55:3002Masq0  0  4
>
>
>
> On keepalived, it does not shown the down server at all…
>
>
>
> TCP  192.168.5.100:3002 wlc persistent 50
>
>   -> 192.168.8.53:3002Masq1  0  0
>
>
>
> Why is that? It makes it impossible to see when services are down.
>
>
>
>
>
>
> Disclaimer : This email and any files transmitted with it are confidential
> and intended solely for intended recipients. If you are not the named
> addressee you should not disseminate, distribute, copy or alter this email.
> Any views or opinions presented in this email are solely those of the
> author and might not represent those of Physician Select Management.
> Warning: Although Physician Select Management has taken reasonable
> precautions to ensure no viruses are present in this email, the company
> cannot accept responsibility for any loss or damage arising from the use of
> this email or attachments.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Fencing explanation

2021-01-04 Thread Reid Wahl
On Mon, Jan 4, 2021 at 4:56 AM Ulrich Windl
 wrote:
>
> >>> Ignazio Cassano  schrieb am 28.12.2020 um 22:21 
> >>> in
> Nachricht
> :
> > Hello all, I am setting a pacemaker cluster with centos 7 and ipmi idrac
> > fencing devices.
> > What I did not understand is how set the number of seconds before a node is
> > rebooted by stonith.
>
> Actually the only reason for a delay IMHO would be:
> 1) You want to give dirty blocks a chance to be written, especially if 
> multiple resource operations had been initiated before fencing
> 2) You want to have a look at the situation before fencing (maybe trying to 
> clean it up to avoid fencing)
>
> Unsure cleaning up in 2) can really avoid fencing, but I think in HP-UX 
> Service Guard it was possible.

The most common reason I've seen for setting a delay is to avoid a
fence race in a two-node cluster. A user can set a delay for one
node's stonith device so that both nodes don't power each other off at
the same time in the event of a network disruption.

> > If the cluster is made up 3 nodes (A, B, C) if the node C is unreacheable
> > (for example have network cards corrupetd) after how many second is
> > rebooted by stonith ?
> > Which is the parameter to set the number of seconds?
> > Sorry for my bad english
> > Thanks
> > Ignazio
>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] pacemaker 2.0.5 version pcs status resources command is not working

2021-02-03 Thread Reid Wahl
Looks like pcs-0.9 isn't fully compatible with pacemaker >= 2.0.3.
  -
https://github.com/ClusterLabs/pcs/commit/0cf06b79f6dcabb780ee1fa7fee0565d73789329

The resource_status() function in older pcs versions doesn't match the
lines in the crm_mon output of newer pacemaker versions.

On Wed, Feb 3, 2021 at 9:10 AM S Sathish S  wrote:

> Hi Team,
>
>
>
> In latest pacemaker version 2.0.5 we are not getting "pcs status resource"
> command output but in older version we used to get the output.
>
>
>
> Kindly let us know any already command to get pcs full list resource.
>
>
>
> *Latest Pacemaker version* :
>
> pacemaker-2.0.5 -->
> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.5
>
> corosync-2.4.4 -->  https://github.com/corosync/corosync/tree/v2.4.4
>
> pcs-0.9.169
>
>
>
> [root@node2 ~]# pcs status resources
>
> [root@node2 ~]#
>
>
>
> *Older Pacemaker version* :
>
>
>
> pacemaker-2.0.2 -->
> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2
>
> corosync-2.4.4 -->  https://github.com/corosync/corosync/tree/v2.4.4
>
> pcs-0.9.169
>
>
>
> [root@node1 ~]# pcs status resources
>
> TOMCAT_node1 (ocf::provider:TOMCAT_RA):  Started node1
>
> HEALTHMONITOR_node1  (ocf::provider:HealthMonitor_RA):   Started node1
>
> SNMP_node1   (ocf::pacemaker:ClusterMon):Started node1
>
> [root@node1 ~]#
>
>
>
> Thanks and Regards,
>
> S Sathish S
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] pacemaker 2.0.5 version pcs status resources command is not working

2021-02-03 Thread Reid Wahl
With that in mind, I'd suggest upgrading to a newer pcs version if
possible. If not, then you may have to do something more hack-y, like `pcs
status | grep '(.*:.*):'`.

On Wed, Feb 3, 2021 at 1:26 PM Reid Wahl  wrote:

> Looks like pcs-0.9 isn't fully compatible with pacemaker >= 2.0.3.
>   -
> https://github.com/ClusterLabs/pcs/commit/0cf06b79f6dcabb780ee1fa7fee0565d73789329
>
> The resource_status() function in older pcs versions doesn't match the
> lines in the crm_mon output of newer pacemaker versions.
>
> On Wed, Feb 3, 2021 at 9:10 AM S Sathish S 
> wrote:
>
>> Hi Team,
>>
>>
>>
>> In latest pacemaker version 2.0.5 we are not getting "pcs status
>> resource" command output but in older version we used to get the output.
>>
>>
>>
>> Kindly let us know any already command to get pcs full list resource.
>>
>>
>>
>> *Latest Pacemaker version* :
>>
>> pacemaker-2.0.5 -->
>> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.5
>>
>> corosync-2.4.4 -->  https://github.com/corosync/corosync/tree/v2.4.4
>>
>> pcs-0.9.169
>>
>>
>>
>> [root@node2 ~]# pcs status resources
>>
>> [root@node2 ~]#
>>
>>
>>
>> *Older Pacemaker version* :
>>
>>
>>
>> pacemaker-2.0.2 -->
>> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2
>>
>> corosync-2.4.4 -->  https://github.com/corosync/corosync/tree/v2.4.4
>>
>> pcs-0.9.169
>>
>>
>>
>> [root@node1 ~]# pcs status resources
>>
>> TOMCAT_node1 (ocf::provider:TOMCAT_RA):  Started node1
>>
>> HEALTHMONITOR_node1  (ocf::provider:HealthMonitor_RA):   Started node1
>>
>> SNMP_node1   (ocf::pacemaker:ClusterMon):Started node1
>>
>> [root@node1 ~]#
>>
>>
>>
>> Thanks and Regards,
>>
>> S Sathish S
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] pcs status command output consist of * in each line , is this expected behavior

2021-02-03 Thread Reid Wahl
On Wed, Feb 3, 2021 at 1:50 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> S Sathish S  schrieb am 02.02.2021 um 07:20
> in
> Nachricht
> <
> dbbpr07mb73888356f8692d07d195ffb6d5...@dbbpr07mb7388.eurprd07.prod.outlook.com
> >
>
> > Hi Team,
> >
> > we have taken latest pacemaker version after that we found pcs status
> > command output consist of * in each line , is this expected behavior.
> >
> > https://github.com/ClusterLabs/pacemaker/tree/Pacemaker‑2.0.5
> >
> > pcs status command output :
> >
> > Cluster name: TEST
> > Cluster Summary:
> >   * Stack: corosync
> >   * Current DC: node1 (version 2.0.5‑ba59be7122) ‑ partition with quorum
> >   * Last updated: Tue Feb  2 06:20:14 2021
> >   * Last change:  Mon Feb  1 19:41:27 2021 by hacluster via crmd on node1
> >   * 1 node configured
> >   * 12 resource instances configured
> >
> > Node List:
> >   * Online: [ node1 ]
> >
> > Full List of Resources:
> >   * TOMCAT_node1  (ocf::provider:TOMCAT_RA):   Started node1
> >   * HEALTHMONITOR_node1   (ocf::provider:HealthMonitor_RA):
> > Started node1
> >   * SNMP_node1(ocf::pacemaker:ClusterMon): Started node1
> >
> > Daemon Status:
> >   corosync: active/enabled
> >   pacemaker: active/enabled
> >   pcsd: active/enabled
>
> Do me it seems as if "Daemon Staus" lacks the stars ;-)
>

Yep, pcs status calls crm_mon to do a lot of the work. The bulleted lists
that have stars are the output of crm_mon. The "Daemon Status" section is
appended by pcs after crm_mon has run; it's not part of crm_mon.

I wonder if there's any plan to add stars to pcs indentation, to match the
crm_mon output.

Actually crm_mon oputputs those stars:
>
> # crm_mon -1Arfj
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: h16 (version
> 2.0.4+20200616.2deceaa3a-3.3.1-2.0.4+20200616.2deceaa3a) - partition with
> quorum
>   * Last updated: Wed Feb  3 09:52:39 2021
>   * Last change:  Wed Feb  3 09:52:36 2021 by hacluster via crmd on h16
> ...
>
> >
> >
> > Thanks and Regards,
> > S Sathish S
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] pcs status command output consist of * in each line , is this expected behavior

2021-02-02 Thread Reid Wahl
Looks like it.

https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.5/tools/crm_mon_curses.c#L295-L301

On Tue, Feb 2, 2021 at 9:36 AM S Sathish S  wrote:

> Hi Team,
>
>
>
> we have taken latest pacemaker version after that we found pcs status
> command output consist of * in each line , is this expected behavior.
>
>
>
> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.5
>
>
>
> pcs status command output :
>
>
>
> Cluster name: TEST
>
> Cluster Summary:
>
>   * Stack: corosync
>
>   * Current DC: node1 (version 2.0.5-ba59be7122) - partition with quorum
>
>   * Last updated: Tue Feb  2 06:20:14 2021
>
>   * Last change:  Mon Feb  1 19:41:27 2021 by hacluster via crmd on node1
>
>   * 1 node configured
>
>   * 12 resource instances configured
>
>
>
> Node List:
>
>   * Online: [ node1 ]
>
>
>
> Full List of Resources:
>
>   * TOMCAT_node1  (ocf::provider:TOMCAT_RA):   Started node1
>
>   * HEALTHMONITOR_node1   (ocf::provider:HealthMonitor_RA):
> Started node1
>
>   * SNMP_node1(ocf::pacemaker:ClusterMon): Started node1
>
>
>
> Daemon Status:
>
>   corosync: active/enabled
>
>   pacemaker: active/enabled
>
>   pcsd: active/enabled
>
>
>
>
>
> Thanks and Regards,
>
> S Sathish S
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: starting systemd resources

2021-02-05 Thread Reid Wahl
Hi, Ulrich. I presume you're talking about the log_finished line, which
looks like this in pacemaker.log:

Feb 05 01:48:47.192 fastvm-rhel-8-0-23 pacemaker-execd [15446]
(log_finished@execd_commands.c:214)  info: dummy start (call 23, PID 18743)
exited with status 0 (execution time 11ms, queue time 0ms)

Is that correct?

I got curious and took a look. It looks like this comes down to the
action_complete() function:
  -
https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.5/daemons/execd/execd_commands.c#L896-L1050

This function calls cmd_finalize() at the end, which calls a chain of
functions ending in log_finished(). log_finished() logs the "start ...
exited with status" line shown above.

However, systemd resources have to be handled a bit differently. We have to
schedule a monitor operation to run after the start operation, in order to
check whether the start was truly a success.
~~~
if (pcmk__str_eq(rclass, PCMK_RESOURCE_CLASS_SYSTEMD, pcmk__str_casei))
{
if ((cmd->exec_rc == PCMK_OCF_OK)
&& pcmk__strcase_any_of(cmd->action, "start", "stop", NULL)) {
/* systemd returns from start and stop actions after the action
 * begins, not after it completes. We have to jump through a few
 * hoops so that we don't report 'complete' to the rest of
pacemaker
 * until it's actually done.
 */
goagain = true;
cmd->real_action = cmd->action;
cmd->action = strdup("monitor");
...
if (goagain) {
...
schedule_lrmd_cmd(rsc, cmd);

/* Don't finalize cmd, we're not done with it yet */
return;
~~~

So for the start operation, it never reaches the cmd_finalize() call at the
end, until the follow-up monitor runs. The follow-up monitor operation does
end up calling cmd_finalize() at the end of action_complete. But the
"log_finished" message is logged at debug level for monitor operations. So
you won't see it unless debugging is enabled.

Does this make sense?

Example:
~~~
Feb 05 02:06:51.123 fastvm-rhel-8-0-23 pacemaker-execd [19354]
(log_execute)info: executing - rsc:nfs-daemon action:start
call_id:20
Feb 05 02:06:51.123 fastvm-rhel-8-0-23 pacemaker-execd [19354]
(systemd_unit_exec)  debug: Performing asynchronous start op on systemd
unit nfs-server.service named 'nfs-daemon'
Feb 05 02:06:51.124 fastvm-rhel-8-0-23 pacemaker-execd [19354]
(systemd_unit_exec_with_unit)debug: Calling StartUnit for
nfs-daemon: /org/freedesktop/systemd1/unit/nfs_2dserver_2eservice
Feb 05 02:06:51.517 fastvm-rhel-8-0-23 pacemaker-execd [19354]
(action_complete)debug: nfs-daemon start may still be in progress:
re-scheduling (elapsed=394ms, remaining=99606ms, start_delay=2000ms)
Feb 05 02:06:53.518 fastvm-rhel-8-0-23 pacemaker-execd [19354]
(log_execute)debug: executing - rsc:nfs-daemon action:monitor
call_id:20
Feb 05 02:06:53.518 fastvm-rhel-8-0-23 pacemaker-execd [19354]
(systemd_unit_exec)  debug: Performing asynchronous status op on systemd
unit nfs-server.service named 'nfs-daemon'
Feb 05 02:06:53.521 fastvm-rhel-8-0-23 pacemaker-execd [19354]
(action_complete)debug: nfs-daemon systemd start is now complete
(elapsed=2397ms, remaining=97603ms): ok (0)
Feb 05 02:06:53.521 fastvm-rhel-8-0-23 pacemaker-execd [19354]
(log_finished)   debug: nfs-daemon monitor (call 20) exited with status
0 (execution time 2397ms, queue time 0ms)
~~~



On Thu, Feb 4, 2021 at 11:25 PM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> Hi!
>
> While analyzing cluster problems I noticed this:
> Normal resources executed via OCF RAs create two log entries by
> pacemaker-execd: One when starting the resource and another when the
> resource completed starting.
> However for systemd units I only get a start message. Is that intentional?
> Does that mean systemd starts are asynchronous in general (i.e.: The
> process returns before start in complete)?
> (Still I get a completed message from pacemaker-controld)
>
> Example:
> Feb 04 15:41:25 h19 pacemaker-execd[7793]:  notice: executing -
> rsc:prm_virtlockd action:start call_id:95
> Feb 04 15:41:27 h19 pacemaker-execd[7793]:  notice: executing -
> rsc:prm_libvirtd action:start call_id:97
>
> So one could guess that lirtlockd and libvirtd are staring concurrently,
> but the did not because of this sequence:
> Feb 04 15:41:27 h19 pacemaker-controld[7796]:  notice: Result of start
> operation for prm_virtlockd on h19: ok
> Feb 04 15:41:27 h19 pacemaker-execd[7793]:  notice: executing -
> rsc:prm_libvirtd action:start call_id:97
>
> Regards,
> Ulrich
>
>
> _______
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>

Re: [ClusterLabs] Antw: [EXT] Re: Help: Cluster resource relocating to rebooted node automatically

2021-02-11 Thread Reid Wahl
 (most likely "pcs resource move"). When you "move" a resource,
> >> you're actually telling the cluster to prefer a specific node, and it
> >> remembers that preference until you tell it otherwise. You can remove
> >> the preference with "pcs resource clear" (or equivalently crm_resource
> >> --clear).
> >>
> >> I see your resources have resource-stickiness=1. That is how much
> >> preference an active resource has for the node that it is currently on.
> >> You can also see the above constraint has a score of INFINITY. If the
> >> scores were set such that the stickiness was higher than the
> >> constraint, then the stickiness would win and the resource would stay
> >> put.
> >>
> >> > Ordering Constraints:
> >> > Colocation Constraints:
> >> > Ticket Constraints:
> >> >
> >> > Alerts:
> >> >  No alerts defined
> >> >
> >> > Resources Defaults:
> >> >  resource-stickiness=1000
> >> > Operations Defaults:
> >> >  No defaults set
> >> >
> >> > Cluster Properties:
> >> >  cluster-infrastructure: corosync
> >> >  cluster-name: EMS
> >> >  dc-version: 2.0.2-3.el8-744a30d655
> >> >  have-watchdog: false
> >> >  last-lrm-refresh: 1612951127
> >> >  symmetric-cluster: true
> >> >
> >> > Quorum:
> >> >   Options:
> >> >
> >> > --
> >> >
> >> > Regards,
> >> > Ben
> >> >
> >> >
> >> > ___
> >> > Manage your subscription:
> >> > https://lists.clusterlabs.org/mailman/listinfo/users
> >> >
> >> > ClusterLabs home: https://www.clusterlabs.org/
> >> --
> >> Ken Gaillot 
> >>
> >>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Help: Cluster resource relocating to rebooted node automatically

2021-02-11 Thread Reid Wahl
On Thu, Feb 11, 2021 at 1:35 AM Reid Wahl  wrote:

>
>
> On Thu, Feb 11, 2021 at 12:35 AM Ulrich Windl <
> ulrich.wi...@rz.uni-regensburg.de> wrote:
>
>> >>> "Ben .T.George"  schrieb am 10.02.2021 um
>> 16:14 in
>> Nachricht
>> :
>> > HI
>> >
>> > thanks for the Help and i have done "pcs resource clear" and tried the
>> same
>> > method again, now the resource is not going back.
>>
>
> To be perfectly clear, did you run `pcs resource clear ems_eg`? That's the
> full command line to remove the cli-prefer-ems_rg constraint.
>

I'm sorry. I had misread your message -- I thought you were saying the
issue was still occurring. I'm glad it's fixed now :)

>
>> > One more thing I noticed is that my service was from systemd and I have
>> > created a custom systemd.service file.
>> >
>> > If i freeze the resource group, start and stop the service my using
>> > systemctl, is happening immediately
>> >
>> > When I reboot the active node, the cluster is trying to stop the
>> service,
>> > it is taking around 1 minutes to stop the service. and at the same time
>> if
>> > i check the vm console, the shutdown of the vm process is stuck for some
>> > time for stopping high availability services.
>>
>> To give any advice on that we need details, typically logs.
>>
>
> +1. Generally, a snippet from /var/log/pacemaker/pacemaker.log (on
> pacemaker version 2) or /var/log/cluster/corosync.log (on pacemaker version
> 1) is ideal. In some cases, system logs (e.g., /var/log/messages or
> journalctl output) can also be helpful.
>
>>
>> >
>> > Sorry for asking this, i am very new to this cluster
>> >
>> > Regards,
>> > Ben
>> >
>> > Is this the expected behaviour?
>> >
>> > On Wed, Feb 10, 2021 at 8:53 PM Ken Gaillot 
>> wrote:
>> >
>> >> On Wed, 2021-02-10 at 17:21 +0300, Ben .T.George wrote:
>> >> > HI
>> >> >
>> >> > I have created PCS based 2 node cluster on centos 7 almost
>> >> > everything is working fine,
>> >> >
>> >> > My client machine is on vmware and when I reboot the active node, the
>> >> > service group is relocating to the passive node and the resources are
>> >> > starting fine(one IP and application).
>> >> >
>> >> > But whenever the other node reboots and joins back to the cluster,
>> >> > the resources are moved back to that node.
>> >> >
>> >> > please find below config :
>> >> > 
>> >> > Cluster Name: EMS
>> >> > Corosync Nodes:
>> >> >  zkwemsapp01.example.com zkwemsapp02.example.com
>> >> > Pacemaker Nodes:
>> >> >  zkwemsapp01.example.com zkwemsapp02.example.com
>> >> >
>> >> > Resources:
>> >> >  Group: ems_rg
>> >> >   Resource: ems_vip (class=ocf provider=heartbeat type=IPaddr2)
>> >> >Attributes: cidr_netmask=24 ip=10.96.11.39
>> >> >Meta Attrs: resource-stickiness=1
>> >> >Operations: monitor interval=30s (ems_vip-monitor-interval-30s)
>> >> >start interval=0s timeout=20s (ems_vip-start-interval-
>> >> > 0s)
>> >> >stop interval=0s timeout=20s (ems_vip-stop-interval-
>> >> > 0s)
>> >> >   Resource: ems_app (class=systemd type=ems-app)
>> >> >Meta Attrs: resource-stickiness=1
>> >> >Operations: monitor interval=60 timeout=100 (ems_app-monitor-
>> >> > interval-60)
>> >> >start interval=0s timeout=100 (ems_app-start-interval-
>> >> > 0s)
>> >> >stop interval=0s timeout=100 (ems_app-stop-interval-
>> >> > 0s)
>> >> >
>> >> > Stonith Devices:
>> >> >  Resource: ems_vmware_fence (class=stonith type=fence_vmware_soap)
>> >> >   Attributes: ip=10.151.37.110 password=!CM4!!6j7yiApFT
>> >> > pcmk_host_map=zkwemsapp01.example.com:
>> ZKWEMSAPP01;zkwemsapp02.example
>> >> > .com:ZKWEMSAPP02 ssl_insecure=1 username=mtc_tabs\redhat.fadmin
>> >> >   Operations: monitor interval=60s (ems_vmware_fence-monitor-
>> >> > interval-60s)
>> >> > Fencing Levels:
>> >> >   Target: zkwemsapp01.example.c

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles

2021-03-26 Thread Reid Wahl
On Fri, Mar 26, 2021 at 6:27 AM Andrei Borzenkov 
wrote:

> On Fri, Mar 26, 2021 at 10:17 AM Ulrich Windl
>  wrote:
> >
> > >>> Andrei Borzenkov  schrieb am 26.03.2021 um
> 06:19 in
> > Nachricht <534274b3-a6de-5fac-0ae4-d02c305f1...@gmail.com>:
> > > On 25.03.2021 21:45, Reid Wahl wrote:
> > >> FWIW we have this KB article (I seem to remember Strahil is a Red Hat
> > >> customer):
> > >>   - How do I configure SAP HANA Scale-Up System Replication in a
> Pacemaker
> > >> cluster when the HANA filesystems are on NFS shares?(
> > >> https://access.redhat.com/solutions/5156571)
> > >>
> > >
> > > "How do I make the cluster resources recover when one node loses access
> > > to the NFS server?"
> > >
> > > If node loses access to NFS server then monitor operations for
> resources
> > > that depend on NFS availability will fail or timeout and pacemaker will
> > > recover (likely by rebooting this node). That's how similar
> > > configurations have been handled for the past 20 years in other HA
> > > managers. I am genuinely interested, have you encountered the case
> where
> > > it was not enough?
> >
> > That's a big problem with the SAP design (basically it's just too
> complex).
> > In the past I had written a kind of resource agent that worked without
> that
> > overly complex overhead, but since those days SAP has added much more
> > complexity.
> > If the NFS server is external, pacemaker could fence your nodes when the
> NFS
> > server is down as first the monitor operation will fail (hanging on
> NFS), the
> > the recover (stop/start) will fail (also hanging on NFS).
>
> And how exactly placing NFS resource under pacemaker control is going
> to change it?
>

I noted earlier based on the old case notes:

"Apparently there were situations in which the SAPHana resource wasn't
failing over when connectivity was lost with the NFS share that contained
the hdb* binaries and the HANA data. I don't remember the exact details
(whether demotion was failing, or whether it wasn't even trying to demote
on the primary and promote on the secondary, or what). Either way, I was
surprised that this procedure was necessary, but it seemed to be."

Strahil may be dealing with a similar situation, not sure. I get where
you're coming from -- I too would expect the application that depends on
NFS to simply fail when NFS connectivity is lost, which in turn leads to
failover and recovery. For whatever reason, due to some weirdness of the
SAPHana resource agent, that didn't happen.


> > Even when fencing the
> > node it would not help (resources cannot start) if the NFS server is
> still
> > down.
>
> And how exactly placing NFS resource under pacemaker control is going
> to change it?
>
> > So you may end up with all your nodes being fenced and the fail counts
> > disabling any automatic resource restart.
> >
>
> And how exactly placing NFS resource under pacemaker control is going
> to change it?
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles

2021-03-26 Thread Reid Wahl
On Fri, Mar 26, 2021 at 4:06 AM Strahil Nikolov 
wrote:

> Thanks everyone! I really appreciate your help.
>
> Actually , I found a RH solution (#5423971) that gave me enough ideas  /it
> is missing some steps/ to setup the cluster prooperly.
>

Careful. That solution is for Scale-Out. The solution I gave you[1] is a
similar procedure intended for HANA in a Scale-Up configuration. Use
whichever one is appropriate to your deployment. I didn't think about
Scale-Out at first, because most customers I interact with use Scale-Up.

[1] https://access.redhat.com/solutions/5156571


> So far , I have never used node attributes, order sets and location
> constraints based on 'ocf:pacemaker: attribute's active/inactive values .
>
> I can say that I have learned alot today.
>
>
> Best Regards,
> Strahil Nikolov
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Community adoption of PAF vs pgsql

2021-03-26 Thread Reid Wahl
If you have an enterprise support agreement, be sure to also explore
whether your vendor supports one and not the other. For example, Red Hat
currently supports pgsql but not PAF (though there is an open BZ to add
support for PAF).


On Fri, Mar 26, 2021 at 9:14 AM Jehan-Guillaume de Rorthais 
wrote:

> Hi,
>
> I'm one of the PAF author, so I'm biased.
>
> On Fri, 26 Mar 2021 14:51:28 +
> Isaac Pittman  wrote:
>
> > My team has the opportunity to update our PostgreSQL resource agent to
> either
> > PAF (https://github.com/ClusterLabs/PAF) or pgsql
> > (
> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
> ),
> > and I've been charged with comparing them.
>
> In my opinion, you should spend time to actually build some "close-to-prod"
> clusters and train them. Then you'll be able to choose base on some team
> experience.
>
> Both agent have very different spirit and very different administrative
> tasks.
>
> Break your cluster, make some switchover, some failover, how to failback a
> node
> and so on.
>
> > After searching various mailing lists and reviewing the code and
> > documentation, it seems like either could suit our needs and both are
> > actively maintained.
> >
> > One factor that I couldn't get a sense of is community support and
> adoption:
> >
> >   *   Does PAF or pgsql enjoy wider community support or adoption,
> especially
> > for new projects? (I would expect many older projects to be on pgsql due
> to
> > its longer history.)
>
> Sadly, I have absolutely no clues...
>
> >   *   Does either seem to be on the road to deprecation?
>
> PAF is not on its way to deprecation, I have a pending TODO list for it.
>
> I would bet pgsql is not on its way to deprecation either, but I can't
> speak
> for the real authors.
>
> Regards,
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?

2021-03-26 Thread Reid Wahl
On Fri, Mar 26, 2021 at 2:44 PM Antony Stone 
wrote:

> On Friday 26 March 2021 at 18:31:51, Ken Gaillot wrote:
>
> > On Fri, 2021-03-26 at 19:59 +0300, Andrei Borzenkov wrote:
> > > On 26.03.2021 17:28, Antony Stone wrote:
> > > >
> > > > So far all is well and good, my cluster synchronises, starts the
> > > > resources, and everything's working as expected.  It'll move the
> > > > resources from one cluster member to another (either if I ask it to,
> or
> > > > if there's a problem), and it seems to work just as the older version
> > > > did.
> >
> > I'm glad this far was easy :)
>
> Well, I've been using corosync & pacemaker for some years now; I've got
> used
> to some of their quirks and foibles :)
>
> Now I just need to learn about the new ones for the newer versions...
>
> > It's worth noting that pacemaker itself doesn't try to validate the
> > agent meta-data, it just checks for the pieces that are interesting to
> > it and ignores the rest.
>
> I guess that's good, so long as what it does pay attention to is what it
> wants
> to see?
>
> > It's also worth noting that the OCF 1.0 standard is horribly outdated
> > compared to actual use, and the OCF 1.1 standard is being adopted today
> > (!) after many years of trying to come up with something more up-to-
> > date.
>
> So, is ocf-tester no longer the right tools I should be using to check
> this
> sort of thing?  What shouold I be doing instead to make sure my
> configuration
> is valid / acceptable to pacemaker?
>
> > Bottom line, it's worth installing xmllint to see if that helps, but I
> > wouldn't worry about meta-data schema issues.
>
> Well, as stated in my other reply to Andrei, I now get:
>
> /usr/lib/ocf/resource.d/heartbeat/asterisk passed all tests
>
> /usr/lib/ocf/resource.d/heartbeat/anything passed all tests
>
> so I guess it means my configuration file is okay, and I need to look
> somewher
> eelse to find out why pacemaker 2.0.1 is throwing wobblies with exactly
> the
> same resources that pacemaker 1.1.16 can manage quite happily and stably...
>
> > > Either agent does not run as root or something blocks chown. Usual
> > > suspects are apparmor or SELinux.
> >
> > Pacemaker itself can also return this error in certain cases, such as
> > not having permissions to execute the agent. Check the pacemaker detail
> > log (usually /var/log/pacemaker/pacemaker.log) and the system log
> > around these times to see if there is more detail.
>
> I've turned on debug logging, but I'm still not sure I'm seeing *exactly*
> what
> the resource agent checker is doing when it gets this failure.
>
> > It is definitely weird that a privileges error would be sporadic.
> > Hopefully the logs can shed some more light.
>
> I've captured a bunch of them this afternoon and will go through them on
> Monday - it's pretty verbose!
>
> > Another possibility would be to set trace_ra=1 on the actions that are
> > failing to get line-by-line info from the agents.
>
> So, that would be an extra parameter to the resource definition in
> cluster.cib?
>
> Change:
>
> primitive Asterisk asterisk meta migration-threshold=3 op monitor
> interval=5
> timeout=30 on-fail=restart failure-timeout=10s
>
> to:
>
> primitive Asterisk asterisk meta migration-threshold=3 op monitor
> interval=5
> timeout=30 on-fail=restart failure-timeout=10s trace_ra=1
>
> ?
>

It's an instance attribute, not a meta attribute. I'm not familiar with
crmsh syntax but trace_ra=1 would go wherever you would configure a
"normal" option, like `ip=x.x.x.x` for an IPaddr2 resource. It will save a
shell trace of each operation to a file in
/var/lib/heartbeat/trace_ra/asterisk. You would then wait for an operation
to fail, find the file containing that operation's trace, and see what it
tells you about the error.

You might already have some more detail about the error in
/var/log/messages and/or /var/log/pacemaker/pacemaker.log. Look in
/var/log/messages around Fri Mar 26 13:37:08 2021 on the node where the
failure occurred. See if there are any additional messages from the
resource agent, or any stdout or stderr logged by lrmd/pacemaker-execd for
the Asterisk resource.


>
> Antony.
>
> --
> "It is easy to be blinded to the essential uselessness of them by the
> sense of
> achievement you get from getting them to work at all. In other words - and
> this is the rock solid principle on which the whole of the Corporation's
> Galaxy-wide success is founded - their fundamental design flaws are
> completely
> hidden by their superficial design f

Re: [ClusterLabs] Antw: [EXT] Re: Order set troubles

2021-03-25 Thread Reid Wahl
FWIW we have this KB article (I seem to remember Strahil is a Red Hat
customer):
  - How do I configure SAP HANA Scale-Up System Replication in a Pacemaker
cluster when the HANA filesystems are on NFS shares?(
https://access.redhat.com/solutions/5156571)

I can't remember if there was some valid reason why we had to use an
attribute resource, or if we simply didn't think about the sequential=false
require-all=false constraint set approach when planning this out.

On Thu, Mar 25, 2021 at 3:39 AM Strahil Nikolov 
wrote:

> OCF_CHECK_LEVEL 20
> NFS sometimes fails to start (systemd racing condition with dnsmasq)
>
> Best Regards,
> Strahil Nikolov
>
> On Thu, Mar 25, 2021 at 12:18, Andrei Borzenkov
>  wrote:
> On Thu, Mar 25, 2021 at 10:31 AM Strahil Nikolov 
> wrote:
> >
> > Use Case:
> >
> > nfsA is shared filesystem for HANA running in site A
> > nfsB is shared filesystem for HANA running  in site B
> >
> > clusterized resource of type SAPHanaTopology must run on all systems if
> the FS for the HANA is running
> >
>
> And the reason you put NFS under pacemaker control in the first place?
> It is not going to switch over, just put it in /etc/fstab.
>
> > Yet, if siteA dies for some reason, I want to make SAPHanaTopology to
> still start on the nodes in site B.
> >
> > I think that it's a valid use case.
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Thu, Mar 25, 2021 at 8:59, Ulrich Windl
> >  wrote:
> > >>> Ken Gaillot  schrieb am 24.03.2021 um 18:56 in
> > Nachricht
> > <5bffded9c6e614919981dcc7d0b2903220bae19d.ca...@redhat.com>:
> > > On Wed, 2021‑03‑24 at 09:27 +, Strahil Nikolov wrote:
> > >> Hello All,
> > >>
> > >> I have a trouble creating an order set .
> > >> The end goal is to create a 2 node cluster where nodeA will mount
> > >> nfsA , while nodeB will mount nfsB.On top of that a depended cloned
> > >> resource should start on the node only if nfsA or nfsB has started
> > >> locally.
> >
> > This looks like ad odd design to me, and I wonder: What is the use case?
> > (We are using "NFS loop-mounts" for many years, where the cluster needs
> the
> > NFS service it provides, but that's a different design)
> >
> > Regards,
> > Ulrich
> >
> >
> >
> > >>
> > >> A prototype code would be something like:
> > >> pcs constraint order start (nfsA or nfsB) then start resource‑clone
> > >>
> > >> I tried to create a set like this, but it works only on nodeB:
> > >> pcs constraint order set nfsA nfsB resource‑clone
> > >>
> > >> Any idea how to implement that order constraint ?
> > >> Thanks in advance.
> > >>
> > >> Best Regards,
> > >> Strahil Nikolov
> > >
> > > Basically you want two sets, one with nfsA and nfsB with no ordering
> > > between them, and a second set with just resource‑clone, ordered after
> > > the first set.
> > >
> > > I believe the pcs syntax is:
> > >
> > > pcs constraint order set nfsA nfsB sequential=false require‑all=false
> > > set resource‑clone
> > >
> > > sequential=false says nfsA and nfsB have no ordering between them, and
> > > require‑all=false says that resource‑clone only needs one of them.
> > >
> > > (I don't remember for sure the order of the sets in the command, i.e.
> > > whether it's the primary set first or the dependent set first, but I
> > > think that's right.)
> > > ‑‑
> > > Ken Gaillot  > >
> > >
> > > ___
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
>
> >
> >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Order set troubles

2021-03-26 Thread Reid Wahl
On Thu, Mar 25, 2021 at 10:20 PM Andrei Borzenkov 
wrote:

> On 25.03.2021 21:45, Reid Wahl wrote:
> > FWIW we have this KB article (I seem to remember Strahil is a Red Hat
> > customer):
> >   - How do I configure SAP HANA Scale-Up System Replication in a
> Pacemaker
> > cluster when the HANA filesystems are on NFS shares?(
> > https://access.redhat.com/solutions/5156571)
> >
>
> "How do I make the cluster resources recover when one node loses access
> to the NFS server?"
>
> If node loses access to NFS server then monitor operations for resources
> that depend on NFS availability will fail or timeout and pacemaker will
> recover (likely by rebooting this node). That's how similar
> configurations have been handled for the past 20 years in other HA
> managers. I am genuinely interested, have you encountered the case where
> it was not enough?
>

Yes, and I was perplexed by this at the time too.

I just went back and checked the notes from the support case that led to
this article, since it's been nearly a year now. Apparently there were
situations in which the SAPHana resource wasn't failing over when
connectivity was lost with the NFS share that contained the hdb* binaries
and the HANA data. I don't remember the exact details (whether demotion was
failing, or whether it wasn't even trying to demote on the primary and
promote on the secondary, or what). Either way, I was surprised that this
procedure was necessary, but it seemed to be.

The whole situation is a bit of a corner case in the first place. IIRC this
procedure only makes a difference if the primary loses contact with the NFS
server but the secondary can still access the NFS server. I expect that to
be relatively rare. If neither node can access the NFS server, then we're
stuck.


>
> > I can't remember if there was some valid reason why we had to use an
> > attribute resource, or if we simply didn't think about the
> sequential=false
> > require-all=false constraint set approach when planning this out.
> >
>
> Because as I already replied, this has different semantic - it will
> start HANA on both nodes if NFS comes up on any one node.
>

Ah yes, that sounds right.

But thank you for the pointer, it demonstrates really interesting
> technique. It also confirms that pacemaker does not have native means to
> express such ordering dependency/constraints. May be it should.
>

I occasionally find that I have to use hacks like this to achieve certain
complex constraint behavior -- especially when it comes to colocation. I
don't know how many of these complex cases would be feasible to make
possible natively via RFE. Sometimes the way colocation is currently
implemented is incompatible with what users want to do. Probably requires
considerable effort to change it, though such requests are worth
documenting in RFEs.

/me makes a note to do that and annoy Ken


> > On Thu, Mar 25, 2021 at 3:39 AM Strahil Nikolov 
> > wrote:
> >
> >> OCF_CHECK_LEVEL 20
> >> NFS sometimes fails to start (systemd racing condition with dnsmasq)
> >>
> >> Best Regards,
> >> Strahil Nikolov
> >>
> >> On Thu, Mar 25, 2021 at 12:18, Andrei Borzenkov
> >>  wrote:
> >> On Thu, Mar 25, 2021 at 10:31 AM Strahil Nikolov  >
> >> wrote:
> >>>
> >>> Use Case:
> >>>
> >>> nfsA is shared filesystem for HANA running in site A
> >>> nfsB is shared filesystem for HANA running  in site B
> >>>
> >>> clusterized resource of type SAPHanaTopology must run on all systems if
> >> the FS for the HANA is running
> >>>
> >>
> >> And the reason you put NFS under pacemaker control in the first place?
> >> It is not going to switch over, just put it in /etc/fstab.
> >>
> >>> Yet, if siteA dies for some reason, I want to make SAPHanaTopology to
> >> still start on the nodes in site B.
> >>>
> >>> I think that it's a valid use case.
> >>>
> >>> Best Regards,
> >>> Strahil Nikolov
> >>>
> >>> On Thu, Mar 25, 2021 at 8:59, Ulrich Windl
> >>>  wrote:
> >>>>>> Ken Gaillot  schrieb am 24.03.2021 um 18:56 in
> >>> Nachricht
> >>> <5bffded9c6e614919981dcc7d0b2903220bae19d.ca...@redhat.com>:
> >>>> On Wed, 2021‑03‑24 at 09:27 +, Strahil Nikolov wrote:
> >>>>> Hello All,
> >>>>>
> >>>>> I have a trouble creating an order set .
> >>>>> The end goal is to create a 2 node cluster where nodeA will mount
> >>>>> nfsA , while nodeB will mount nfsB.On top of that a

Re: [ClusterLabs] Antw: [EXT] Re: Order set troubles

2021-03-26 Thread Reid Wahl
On Thu, Mar 25, 2021 at 11:35 PM Reid Wahl  wrote:

>
>
> On Thu, Mar 25, 2021 at 10:20 PM Andrei Borzenkov 
> wrote:
>
>> On 25.03.2021 21:45, Reid Wahl wrote:
>> > FWIW we have this KB article (I seem to remember Strahil is a Red Hat
>> > customer):
>> >   - How do I configure SAP HANA Scale-Up System Replication in a
>> Pacemaker
>> > cluster when the HANA filesystems are on NFS shares?(
>> > https://access.redhat.com/solutions/5156571)
>> >
>>
>> "How do I make the cluster resources recover when one node loses access
>> to the NFS server?"
>>
>> If node loses access to NFS server then monitor operations for resources
>> that depend on NFS availability will fail or timeout and pacemaker will
>> recover (likely by rebooting this node). That's how similar
>> configurations have been handled for the past 20 years in other HA
>> managers. I am genuinely interested, have you encountered the case where
>> it was not enough?
>>
>
> Yes, and I was perplexed by this at the time too.
>
> I just went back and checked the notes from the support case that led to
> this article, since it's been nearly a year now. Apparently there were
> situations in which the SAPHana resource wasn't failing over when
> connectivity was lost with the NFS share that contained the hdb* binaries
> and the HANA data. I don't remember the exact details (whether demotion was
> failing, or whether it wasn't even trying to demote on the primary and
> promote on the secondary, or what). Either way, I was surprised that this
> procedure was necessary, but it seemed to be.
>
> The whole situation is a bit of a corner case in the first place. IIRC
> this procedure only makes a difference if the primary loses contact with
> the NFS server but the secondary can still access the NFS server. I expect
> that to be relatively rare. If neither node can access the NFS server, then
> we're stuck.
>
>
>>
>> > I can't remember if there was some valid reason why we had to use an
>> > attribute resource, or if we simply didn't think about the
>> sequential=false
>> > require-all=false constraint set approach when planning this out.
>> >
>>
>> Because as I already replied, this has different semantic - it will
>> start HANA on both nodes if NFS comes up on any one node.
>>
>
> Ah yes, that sounds right.
>
> But thank you for the pointer, it demonstrates really interesting
>> technique. It also confirms that pacemaker does not have native means to
>> express such ordering dependency/constraints. May be it should.
>>
>
> I occasionally find that I have to use hacks like this to achieve certain
> complex constraint behavior -- especially when it comes to colocation. I
> don't know how many of these complex cases would be feasible to make
> possible natively via RFE. Sometimes the way colocation is currently
> implemented is incompatible with what users want to do. Probably requires
> considerable effort to change it, though such requests are worth
> documenting in RFEs.
>
> /me makes a note to do that and annoy Ken
>

(Not for this use case though, at least not right now)


>
>> > On Thu, Mar 25, 2021 at 3:39 AM Strahil Nikolov 
>> > wrote:
>> >
>> >> OCF_CHECK_LEVEL 20
>> >> NFS sometimes fails to start (systemd racing condition with dnsmasq)
>> >>
>> >> Best Regards,
>> >> Strahil Nikolov
>> >>
>> >> On Thu, Mar 25, 2021 at 12:18, Andrei Borzenkov
>> >>  wrote:
>> >> On Thu, Mar 25, 2021 at 10:31 AM Strahil Nikolov <
>> hunter86...@yahoo.com>
>> >> wrote:
>> >>>
>> >>> Use Case:
>> >>>
>> >>> nfsA is shared filesystem for HANA running in site A
>> >>> nfsB is shared filesystem for HANA running  in site B
>> >>>
>> >>> clusterized resource of type SAPHanaTopology must run on all systems
>> if
>> >> the FS for the HANA is running
>> >>>
>> >>
>> >> And the reason you put NFS under pacemaker control in the first place?
>> >> It is not going to switch over, just put it in /etc/fstab.
>> >>
>> >>> Yet, if siteA dies for some reason, I want to make SAPHanaTopology to
>> >> still start on the nodes in site B.
>> >>>
>> >>> I think that it's a valid use case.
>> >>>
>> >>> Best Regards,
>> >>> Strahil Nikolov
>> >>>
>> >>> On Thu, Mar 25, 2021 at 8:59, U

Re: [ClusterLabs] staggered resource start/stop

2021-03-30 Thread Reid Wahl
On Mon, Mar 29, 2021 at 10:32 PM Klaus Wenninger 
wrote:

> On 3/29/21 8:44 AM, d tbsky wrote:
> > Reid Wahl 
> >> An order constraint set with kind=Serialize (which is mentioned in the
> first reply to the thread you linked) seems like the most logical option to
> me. You could serialize a set of resource sets, where each inner set
> contains a VirtualDomain resource and an ocf:heartbeat:Delay resource.
> >>
> >>   ⁠5.3.1. Ordering Properties (
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#idm46061192464416
> )
> >>   ⁠5.6. Ordering Sets of Resources (
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#s-resource-sets-ordering
> )
> >   thanks a lot! I don't know there is an official RA acting as
> > delay. that's interesting and useful to me.
> In this case it might be useful not to wait some defined time
> hoping startup of the VM would have gone far enough that
> the IO load has already decayed enough.
>

Agreed.

What about a resource that checks for something running
> inside the VM that indicates that startup has completed?
> Don't remember if the VirtualDomain RA might already
> have such a probe possibility.
>

Interestingly, you can add whatever you want to the monitor operation:

  monitor_scripts: To additionally monitor services within the virtual
domain, add this parameter with a list of scripts to monitor. Note: when
monitor scripts are used, the start and migrate_from operations
   will complete only when all monitor scripts have
completed successfully. Be sure to set the timeout of these operations to
accommodate this delay.


> Klaus
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Colocation per site ?

2021-03-30 Thread Reid Wahl
You can try the following and see if it works, replacing the items in angle
brackets (<>).

# pcs constraint colocation add  with Master
 INFINITY node-attribute=hana__site

However, `pcs constraint colocation add --help` gives no information about
what options it accepts. It just says "[options]".

Usage: pcs constraint [constraints]...
colocation add []  with []
[score] [options] [id=constraint-id]
Request  to run on the same node where pacemaker
has
determined  should run.  Positive values of score
mean the resources should be run on the same node, negative values
mean the resources should not be run on the same node.  Specifying
'INFINITY' (or '-INFINITY') for the score forces 
to
run (or not run) with  (score defaults to
"INFINITY").
A role can be: 'Master', 'Slave', 'Started', 'Stopped' (if no role
is
specified, it defaults to 'Started').

So it's entirely possible that pcs doesn't support creating colocation
constraints with node attributes. If not, then you could edit the CIB
manually and add a constraint like this:



On Mon, Mar 29, 2021 at 9:07 PM Strahil Nikolov 
wrote:

> Hi Ken, can you provide a prototype code example.
>
> Currently,I'm making a script that will be used in a systemd service
> managed by the cluster.
> Yet, I would like to avoid non-pacemaker solutions.
>
> Best Regards,
> Strahil Nikolov
>
> On Mon, Mar 29, 2021 at 20:12, Ken Gaillot
>  wrote:
> On Sun, 2021-03-28 at 09:20 +0300, Andrei Borzenkov wrote:
> > On 28.03.2021 07:16, Strahil Nikolov wrote:
> > > I didn't mean DC as a designated coordinator, but as a physical
> > > Datecenter location.
> > > Last time I checked, the node attributes for all nodes seemed the
> > > same.I will verify that tomorrow (Monday).
> > >
> >
> > Yes, I was probably mistaken. It is different with scale-out, agent
> > puts
> > information in global property section of CIB.
> >
> > Ideally we'd need expression that says "on node where site attribute
> > is
> > the same as on node where clone master is active" but I guess there
> > is
> > no way to express it in pacemaker.
>
> Yep, colocation by node attribute (combined with colocation with
> promoted role)
>
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_colocation_properties
>
>
>
> >
> > I do not see any easy way to implement it without essentially
> > duplicating SAPHanaTopology. There are some attributes that are
> > defined
> > but never set so far, you may try to open service request to
> > implement
> > consistent attribute for all nodes on current primary site.
> >
> > ...
> >
> > Hmm ... agent sets (at least, should set) hana_${SID}_vhost attribute
> > for each node and this attribute must be unique and different between
> > two sites. May be worth to look into it.
> >
> >
> > > Best Regards,Strahil Nikolov
> > >
> > >
> > >  On Fri, Feb 19, 2021 at 16:51, Andrei Borzenkov<
> > > arvidj...@gmail.com> wrote:  On Fri, Feb 19, 2021 at 2:44 PM
> > > Strahil Nikolov  wrote:
> > > >
> > > >
> > > > > Do you have a fixed relation between node >pairs and VIPs? I.e.
> > > > > must
> > > > > A/D always get VIP1, B/E - VIP2 etc?
> > > >
> > > > I have to verify it again, but generally speaking - yes , VIP1 is
> > > > always on nodeA/D (master), VIP2 on nodeB/E (worker1) , etc.
> > > >
> > > > I guess I can set negative constraints (-inf) -> VIP1 on node B/E
> > > > + nodeC/F, but the stuff with the 'same DC as master' is the
> > > > tricky part.
> > > >
> > >
> > > I am not sure I understand what DC has to do with it. You have two
> > > scale-out SAP HANA instances, one is primary, another is secondary.
> > > If
> > > I understand correctly your requirements, your backup application
> > > needs to contact the primary instance which may failover to another
> > > site. You must be using some resource agent for it, to manage
> > > failover. The only one I am aware of is SAPHanaSR-ScaleOut. It
> > > already
> > > sets different node properties for primary and secondary sites.
> > > Just
> > > use them. If you use something else, just look at what attributes
> > > your
> > > RA sets. Otherwise you will be essentially duplicating your RA
> > > functionality because you will somehow need to find ou

Re: [ClusterLabs] staggered resource start/stop

2021-03-29 Thread Reid Wahl
An order constraint set with kind=Serialize (which is mentioned in the
first reply to the thread you linked) seems like the most logical option to
me. You could serialize a set of resource sets, where each inner set
contains a VirtualDomain resource and an ocf:heartbeat:Delay resource.

 ⁠5.3.1. Ordering Properties (
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#idm46061192464416
)
 ⁠5.6. Ordering Sets of Resources (
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#s-resource-sets-ordering
)


On Sun, Mar 28, 2021 at 7:02 PM d tbsky  wrote:

> Hi:
>since the vm start/stop at once will consume disk IO, I want to
> start/stop the vm
> one-by-one with delay.
>
> search the email-list I found the discussion
> https://oss.clusterlabs.org/pipermail/pacemaker/2013-August/043128.html
>
> now I am testing rhel8 with pacemaker 2.0.4. I wonder if there are
> new methods to solve the problem. I search the document but didn't
> find new parameters for the job.
>
> if possible I don't want to modify VirtualDomain RA which comes
> with standard rpm package. maybe I should write a new RA which stagger
> the node utilization. but if I reset the node utilization when cluster
> restart, there maybe a race condition.
>
>  thanks for help!
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] staggered resource start/stop

2021-03-29 Thread Reid Wahl
On Mon, Mar 29, 2021 at 3:35 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> d tbsky  schrieb am 29.03.2021 um 04:01 in Nachricht
> :
> > Hi:
> >since the vm start/stop at once will consume disk IO, I want to
> > start/stop the vm
> > one‑by‑one with delay.
>
> I'm surprised that in these days of fast disks and SSDs this is still an
> issue.
> Maybe don't delay the start, but limit concurrent starts.
> Or maybe add some weak ordering between the VMs.
>

kind=Serialize does this. It makes the resources start consecutively, in no
particular order. I added the comment about ocf:heartbeat:Delay because D
mentioned wanting a delay... but I don't see why it would be necessary, if
Serialize is used.


> >
> > search the email‑list I found the discussion
> > https://oss.clusterlabs.org/pipermail/pacemaker/2013‑August/043128.html
> >
> > now I am testing rhel8 with pacemaker 2.0.4. I wonder if there are
> > new methods to solve the problem. I search the document but didn't
> > find new parameters for the job.
> >
> > if possible I don't want to modify VirtualDomain RA which comes
> > with standard rpm package. maybe I should write a new RA which stagger
> > the node utilization. but if I reset the node utilization when cluster
> > restart, there maybe a race condition.
> >
> >  thanks for help!
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> _______
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] failure-timeout not working in corosync 2.0.1

2021-03-31 Thread Reid Wahl
Hi, Antony. failure-timeout should be a resource meta attribute, not an
attribute of the monitor operation. At least I'm not aware of it being
configurable per-operation -- maybe it is. Can't check at the moment :)

On Wednesday, March 31, 2021, Antony Stone 
wrote:
> Hi.
>
> I've pared my configureation down to almost a bare minimum to demonstrate
the
> problem I'm having.
>
> I have two questions:
>
> 1. What command can I use to find out what pacemaker thinks my
cluster.cib file
> really means?
>
> I know what I put in it, but I want to see what pacemaker has understood
from
> it, to make sure that pacemaker has the same idea about how to manage my
> resources as I do.
>
>
> 2. Can anyone tell me what the problem is with the following cluster.cib
> (lines split on spaces to make things more readable, the actual file
consists
> of four lines of text):
>
> primitive IP-float4
> IPaddr2
> params
> ip=10.1.0.5
> cidr_netmask=24
> meta
> migration-threshold=3
> op
> monitor
> interval=10
> timeout=30
> on-fail=restart
> failure-timeout=180
> primitive IPsecVPN
> lsb:ipsecwrapper
> meta
> migration-threshold=3
> op
> monitor
> interval=10
> timeout=30
> on-fail=restart
> failure-timeout=180
> group Everything
> IP-float4
> IPsecVPN
> resource-stickiness=100
> property cib-bootstrap-options:
> stonith-enabled=no
> no-quorum-policy=stop
> start-failure-is-fatal=false
> cluster-recheck-interval=60s
>
> My problem is that "failure-timeout" is not being honoured.  A resource
> failure simply never times out, and 3 failures (over a fortnight, if
that's
> how long it takes to get 3 failures) mean that the resources move.
>
> I want a failure to be forgotten about after 180 seconds (or at least,
soon
> after that - 240 seconds would be fine, if cluster-recheck-interval means
that
> 180 can't quite be achieved).
>
> Somehow or other, _far_ more than 180 seconds go by, and I *still* have:
>
> fail-count=1 last-failure='Wed Mar 31 21:23:11 2021'
>
> as part of the output of "crm status -f" (the above timestamp is BST, so
> that's 70 minutes ago now).
>
>
> Thanks for any help,
>
>
> Antony.
>
> --
> Don't procrastinate - put it off until tomorrow.
>
>Please reply to the
list;
>  please *don't*
CC me.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] failure-timeout not working in corosync 2.0.1

2021-03-31 Thread Reid Wahl
Maybe Pacemaker-1 was looser in its handling of resource meta attributes vs
operation meta attributes. Good question.

On Wednesday, March 31, 2021, Antony Stone 
wrote:
> On Wednesday 31 March 2021 at 22:53:53, Reid Wahl wrote:
>
>> Hi, Antony. failure-timeout should be a resource meta attribute, not an
>> attribute of the monitor operation. At least I'm not aware of it being
>> configurable per-operation -- maybe it is. Can't check at the moment :)
>
> Okay, I'll try moving it - but that still leaves me wondering why it
works fine
> in pacemaker 1.1.16 and not in 2.0.1.
>
>
> Antony.
>
> --
> Python is executable pseudocode.
> Perl is executable line noise.
>
>Please reply to the
list;
>  please *don't*
CC me.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] SAPHanaController & SAPHanaTopology question

2021-04-02 Thread Reid Wahl
On Fri, Apr 2, 2021 at 2:04 PM Strahil Nikolov 
wrote:

> Hi Reid,
>
> I will check it out in Monday, but I'm pretty sure I created an order set
> that first stops the topology and only then it stops the nfs-active.
>
> Yet, I made the stupid decision to prevent ocf:heartbeat:Filesystem (and
> setting a huge timeout for the stop operation) from killing those 2 SAP
> processes which led to 'I can't umount, giving up'-like notification and of
> course fenced the entire cluster :D .
>
> Note taken, stonith has now different delays , and Filesystem can kill the
> processes.
>
> As per the SAP note from Andrei, these could really be 'fast restart'
> mechanisms in HANA 2.0 and it looks safe to be killed (will check with SAP
> about that).
>
>
> P.S: Is there a way to remove a whole set in pcs , cause it's really
> irritating when the stupid command wipes the resource from multiple order
> constraints?
>

If you mean a whole constraint set, then yes -- run `pcs constraint --full`
to get a list of all constraints with their constraint IDs. Then run `pcs
constraint remove ` to remove a particular constraint. This
can include set constraints.


>
> Best Regards,
> Strahil Nikolov
>
>
>
> On Fri, Apr 2, 2021 at 23:44, Reid Wahl
>  wrote:
> Hi, Strahil.
>
> Based on the constraints documented in the article you're following (RH KB
> solution 5423971), I think I see what's happening.
>
> The SAPHanaTopology resource requires the appropriate nfs-active attribute
> in order to run. That means that if the nfs-active attribute is set to
> false, the SAPHanaTopology resource must stop.
>
> However, there's no rule saying SAPHanaTopology must finish stopping
> before the nfs-active attribute resource stops. In fact, it's quite the
> opposite: the SAPHanaTopology resource stops only after the nfs-active
> resource stops.
>
> At the same time, the NFS resources are allowed to stop after the
> nfs-active attribute resource has stopped. So the NFS resources are
> stopping while the SAPHana* resources are likely still active.
>
> Try something like this:
> # pcs constraint order hana_nfs1_active-clone then
> SAPHanaTopology__-clone kind=Optional
> # pcs constraint order hana_nfs2_active-clone then
> SAPHanaTopology__-clone kind=Optional
>
> This says "if both hana_nfs1_active and SAPHanaTopology are scheduled to
> start, then make hana_nfs1_active start first. If both are scheduled to
> stop, then make SAPHanaTopology stop first."
>
> "kind=Optional" means there's no order dependency unless both resources
> are already going to be scheduled for the action. I'm using kind=Optional
> here even though kind=Mandatory (the default) would make sense, because
> IIRC there were some unexpected interactions with ordering constraints for
> clones, where events on one node had unwanted effects on other nodes.
>
> I'm not able to test right now since setting up an environment for this
> even with dummy resources is non-trivial -- but you're welcome to try this
> both with and without kind=Optional if you'd like.
>
> Please let us know how this goes.
>
> On Fri, Apr 2, 2021 at 2:20 AM Strahil Nikolov 
> wrote:
>
> Hello All,
>
> I am testing the newly built HANA (Scale-out) cluster and it seems that:
> Neither SAPHanaController, nor SAPHanaTopology are stopping the HANA when
> I put the nodes (same DC = same HANA) in standby. This of course leads to a
> situation where the NFS cannot be umounted and despite the stop timeout  -
> leads to fencing(on-fail=fence).
>
> I thought that the Controller resource agent is stopping the HANA and the
> slave role should not be 'stopped' before that .
>
> Maybe my expectations are wrong ?
>
> Best Regards,
> Strahil Nikolov
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> --
> Regards,
>
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


  1   2   3   >