Re: [ClusterLabs] Fence agent definition under Centos7.6

2019-05-31 Thread Ken Gaillot
On Fri, 2019-05-31 at 22:32 +, Michael Powell wrote:
> Although I am personally a novice wrt cluster operation, several
> years ago my company developed a product that used Pacemaker.  I’ve
> been charged with porting that product to a platform running Centos
> 7.6.  The old product ran Pacemaker 1.1.13 and heartbeat.  For the
> most part, the transition to Pacemaker 1.1.19 and Corosync has gone
> pretty well, but there’s one aspect that I’m struggling with: fence-
> agents.
>  
> The old product used a fence agent developed in house to implement
> STONITH.  While it was no trouble to compile and install the code,
> named mgpstonith, I see lots of messages like the following in the
> system log –
>  
> stonith-ng[31120]:error: Unknown fence agent:
> external/mgpstonith  

Support for the "external" fence agents (a.k.a. Linux-HA-style agents)
is a compile-time option in pacemaker because it requires the third-
party cluster-glue library. CentOS doesn't build with that support.

Your options are either build cluster-glue and pacemaker yourself
instead of using the CentOS pacemaker packages, or rewrite the agent to
be an RHCS-style agent:

https://github.com/ClusterLabs/fence-agents/blob/master/doc/FenceAgentAPI.md
  
> stonith-ng[31120]:error: Agent external/mgpstonith not found or
> does not support meta-data: Invalid argument (22)
> stonith-ng[31120]:error: Could not retrieve metadata for fencing
> agent external/mgpstonith   
>  
> I’ve put debug messages in mgpstonith, and as they do not appear in
> the system log, I’ve inferred that it is in fact never executed.
>  
> Initially, I installed mgpstonith on /lib64/stonith/plugins/external,
> which is where it was located on the old product.  I’ve copied it to
> other locations, e.g. /usr/sbin, with no better luck.  I’ve searched
> the web and while I’ve found lots of information about using the
> available fence agents, I’ve not turned up any information on how to
> create one “from scratch”.
>  
> Specifically, I need to know where to put mgpstonith on the target
> system(s).  Generally, I’d appreciate a pointer to any
> documentation/specification relevant to writing code for a fence
> agent.
>  
> Thanks,
>   Michael
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Fence agent definition under Centos7.6

2019-05-31 Thread Michael Powell
Although I am personally a novice wrt cluster operation, several years ago my 
company developed a product that used Pacemaker.  I've been charged with 
porting that product to a platform running Centos 7.6.  The old product ran 
Pacemaker 1.1.13 and heartbeat.  For the most part, the transition to Pacemaker 
1.1.19 and Corosync has gone pretty well, but there's one aspect that I'm 
struggling with: fence-agents.

The old product used a fence agent developed in house to implement STONITH.  
While it was no trouble to compile and install the code, named mgpstonith, I 
see lots of messages like the following in the system log -

stonith-ng[31120]:error: Unknown fence agent: external/mgpstonith
stonith-ng[31120]:error: Agent external/mgpstonith not found or does not 
support meta-data: Invalid argument (22)
stonith-ng[31120]:error: Could not retrieve metadata for fencing agent 
external/mgpstonith

I've put debug messages in mgpstonith, and as they do not appear in the system 
log, I've inferred that it is in fact never executed.

Initially, I installed mgpstonith on /lib64/stonith/plugins/external, which is 
where it was located on the old product.  I've copied it to other locations, 
e.g. /usr/sbin, with no better luck.  I've searched the web and while I've 
found lots of information about using the available fence agents, I've not 
turned up any information on how to create one "from scratch".

Specifically, I need to know where to put mgpstonith on the target system(s).  
Generally, I'd appreciate a pointer to any documentation/specification relevant 
to writing code for a fence agent.

Thanks,
  Michael
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker not reacting as I would expect when two resources fail at the same time

2019-05-31 Thread Ken Gaillot
On Thu, 2019-05-30 at 23:39 +, Harvey Shepherd wrote:
> Hi All,
> 
> I'm running Pacemaker 2.0.1 on a cluster containing two nodes; one
> master and one slave. I have a main master/slave resource
> (m_main_system), a group of resources that run in active-active mode
> (active_active - i.e. run on both nodes), and a group that runs in
> active-disabled mode (snmp_active_disabled - resources only run on
> the current promoted master). The snmp_active_disabled group is
> configured to be co-located with the master of m_main_system, so only
> a failure of the master m_main_system resource can trigger a
> failover. The constraints specify that m_main_system must be started
> before snmp_active_disabled.
> 
> The problem I'm having is that when a resource in the
> snmp_active_disabled group fails and gets into a constant cycle where
> Pacemaker tries to restart it, and I then kill m_main_system on the
> master, then Pacemaker still constantly tries to restart the failed
> snmp_active_disabled resource and ignores the more important
> m_main_system process which should be triggering a failover. If I
> stabilise the snmp_active_disabled resource then Pacemaker finally
> acts on the m_main_system failure. I hope I've described this well
> enough, but I've included a cut down form of my CIB config below if
> it helps!
> 
> Is this a bug or an error in my config? Perhaps the order in which
> the groups are defined in the CIB matters despite the constraints?
> Any help would be gratefully received.
> 
> Thanks,
> Harvey
> 
> 
>   
> 
>   
>   
>   
>   
>   
>   
> 
>   
>   
> 
> 
>   
>   
> 
> 
>   
> 
> 
> 
>   
> 
> 
>   
> 
> 
> 
>   
> 
> 
> 
>   
>  value="false"/>
>   
>   
> 
>   
> 
> 
> 
>   
> 
> 
>   
> 
> 
> 
>   
> 
>   
> 
> 
>   
> 
> 
> 
> 
>   
>type="main-system-ocf">
> 
>id="main_system-start-0"/>
>id="main_system-stop-0"/>
>id="main_system-promote-0"/>
>id="main_system-demote-0"/>
>role="Master" id="main_system-monitor-10s"/>
>role="Slave" id="main_system-monitor-11s"/>
>id="main_system-notify-0"/>
>  
>
> 
>   
>   
>  score="INFINITY" rsc="snmp_active_disabled" with-rsc="m_main_system"
> with-rsc-role="Master"/>
>  kind="Mandatory" first="m_main_system" then="snmp_active_disabled"/>

You want first-action="promote" in the above constraint, otherwise the
slave being started (or the master being started but not yet promoted)
is sufficient to start snmp_active_disabled (even though the colocation
ensures it will only be started on the same node where the master will
be).

I'm not sure if that's related to your issue, but it's worth trying
first.

>  first="m_main_system" then="clone_active_active"/>

You may also want to set interleave to true on clone_active_active, if
you want it to depend only on the local instance of m_main_system, and
not both instances.

>   
>   
> 
>   
>   
>   
> 
>   
> 
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/