Re: [ClusterLabs] Pacemaker resources are not scheduled

2018-04-16 Thread Ken Gaillot
On Mon, 2018-04-16 at 23:52 +0800, lkxjtu wrote:
> > Lkxjtu,
> 
> > On 14/04/18 00:16 +0800, lkxjtu wrote:
> >> My cluster version:
> >> Corosync 2.4.0
> >> Pacemaker 1.1.16
> >> 
> >> There are many resource anomalies. Some resources are only
> monitored
> >> and not recovered. Some resources are not monitored or recovered.
> >> Only one resource of vnm is scheduled normally, but this resource
> >> cannot be started because other resources in the cluster are
> >> abnormal. Just like a deadlock. I have been plagued by this
> problem
> >> for a long time. I just want a stable and highly available
> resource
> >> with infinite recovery for everyone. Is my resource configure
> >> correct?
> 
> > see below
> 
> >> $ cat /etc/corosync/corosync.conf
> >> [co]mpatibility: whitetank
> >> 
> >> [...]
> >> 
> >> logging {
> >>   fileline:off
> >>   to_stderr:   no
> >>   to_logfile:  yes
> >>   logfile: /root/info/logs/pacemaker_cluster/corosync.log
> >>   to_syslog:   yes
> >>   syslog_facility: daemon
> >>   syslog_priority: info
> >>   debug:   off
> >>   function_name:   on
> >>   timestamp:   on
> >>   logger_subsys {
> >> subsys: AMF
> >> debug:  off
> >> tags:   enter|leave|trace1|trace2|trace3|trace4|trace6
> >>   }
> >> }
> >> 
> >> amf {
> >>   mode: disabled
> >> }
> >> 
> >> aisexec {
> >>   user:  root
> >>   group: root
> >> }
> 
> > You are apparently mixing configuration directives for older major
> > version(s) of corosync than you claim to be using.
> > See corosync_conf(5) + votequorum(5) man pages for what you are
> > supposed to configure with the actual version.
> 
> 
> Thank you for your detailed answer!
> Corosync.conf is part of the ansible scripts, but corosync and
> pacemaker are updated with the yum source. So it has caused the
> current gap. I will carefully compare the gap between the new and old
> versions
> 
> 
> > Regarding your pacemaker configuration:
> 
> >> $ crm configure show
> >> 
> >> [... reordered ... ]
> >> 
> >> property cib-bootstrap-options: \
> >> have-watchdog=false \
> >> dc-version=1.1.16-12.el7-94ff4df \
> >> cluster-infrastructure=corosync \
> >> stonith-enabled=false \
> >> start-failure-is-fatal=false \
> >> load-threshold="3200%"
> 
> > You are urged to configure fencing, otherwise asking for sane
> > cluster's behaviour (which you do) is out of question, unless
> > you precisely know why you are not configuring it.
> 
> 
> My environment is a virtual machine environment. There is no headshot
> device. Can I configure fencing? How to do it? 

If you have access to the physical host, see fence_virtd and the
fence_xvm fence agent. They implement fencing by having the hypervisor
kill the VM.

There are some limitations to that approach. If the host itself is
dead, then the fencing will fail. So it makes the most sense when all
the VMs are on a single host -- but that introduces a single point of
failure.

If you don't have access the physical host, see if you have access to
some sort of VM management API. If so, you can write a fence agent that
calls the API to kill the VM (fence agents already exist for some
public cloud providers).

> >> 
> >> [... reordered ... ]
> >> 
> 
> > Furthermore you are using custom resource agents of undisclosed
> > quality and compatibility with the requirements:
> > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-sing
> le/Pacemaker_Explained/index.html#ap-ocf
> > https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-
> guides/ra-dev-guide.asc
> 
> > Since your resources come in isolated groups, I would go one
> > by one, trying to figure out why the group won't run as expected.
> 
> > For instance:
> 
> >> primitive inetmanager inetmanager \
> >> op monitor interval=10s timeout=160 \
> >> op stop interval=0 timeout=60s on-fail=restart \
> >> op start interval=0 timeout=60s on-fail=restart \
> >> meta migration-threshold=2 failure-timeout=60s resource-
> stickiness=100
> >> primitive inetmanager_vip IPaddr2 \
> >> params ip=122.0.1.201 cidr_netmask=24 \
> >> op start interval=0 timeout=20 \
> >> op stop interval=0 timeout=20 \
> >> op monitor timeout=20s interval=10s depth=0 \
> >> meta migration-threshold=3 failure-timeout=60s
> >> [...]
> >> colocation inetmanager_col +inf: inetmanager_vip inetmanager
> >> order inetmanager_order Mandatory: inetmanager inetmanager_vip
> >> 
> >> [...]
> >> 
> >> $ crm status
> >> [...]
> >> Full list of resources:
> >> [...]
> >>  inetmanager_vip(ocf::heartbeat:IPaddr2):   Stopped
> >>  inetmanager(ocf::heartbeat:inetmanager):   Stopped
> >> 
> >> [...]
> >> 
> >> corosync.log of node 122.0.1.10
> >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-
> 10   crmd:  warning: status_from_rc: Action 24
> (inetmanager_monitor_0) on 122.0.1.9 failed (target: 7 vs. rc: 1):
> Error
> >> Apr 

Re: [ClusterLabs] Pacemaker resources are not scheduled

2018-04-16 Thread lkxjtu
> Lkxjtu,

> On 14/04/18 00:16 +0800, lkxjtu wrote:
>> My cluster version:
>> Corosync 2.4.0
>> Pacemaker 1.1.16
 There are many resource anomalies. Some resources are only monitored
>> and not recovered. Some resources are not monitored or recovered.
>> Only one resource of vnm is scheduled normally, but this resource
>> cannot be started because other resources in the cluster are
>> abnormal. Just like a deadlock. I have been plagued by this problem
>> for a long time. I just want a stable and highly available resource
>> with infinite recovery for everyone. Is my resource configure
>> correct?

> see below

>> $ cat /etc/corosync/corosync.conf
>> [co]mpatibility: whitetank
 [...]
 logging {
>>   fileline:off
>>   to_stderr:   no
>>   to_logfile:  yes
>>   logfile: /root/info/logs/pacemaker_cluster/corosync.log
>>   to_syslog:   yes
>>   syslog_facility: daemon
>>   syslog_priority: info
>>   debug:   off
>>   function_name:   on
>>   timestamp:   on
>>   logger_subsys {
>> subsys: AMF
>> debug:  off
>> tags:   enter|leave|trace1|trace2|trace3|trace4|trace6
>>   }
>> }
 amf {
>>   mode: disabled
>> }
 aisexec {
>>   user:  root
>>   group: root
>> }

> You are apparently mixing configuration directives for older major
> version(s) of corosync than you claim to be using.
> See corosync_conf(5) + votequorum(5) man pages for what you are
> supposed to configure with the actual version.


Thank you for your detailed answer!
Corosync.conf is part of the ansible scripts, but corosync and pacemaker are 
updated with the yum source. So it has caused the current gap. I will carefully 
compare the gap between the new and old versions


> Regarding your pacemaker configuration:

>> $ crm configure show
 [... reordered ... ]
 property cib-bootstrap-options: \
>> have-watchdog=false \
>> dc-version=1.1.16-12.el7-94ff4df \
>> cluster-infrastructure=corosync \
>> stonith-enabled=false \
>> start-failure-is-fatal=false \
>> load-threshold="3200%"

> You are urged to configure fencing, otherwise asking for sane
> cluster's behaviour (which you do) is out of question, unless
> you precisely know why you are not configuring it.


My environment is a virtual machine environment. There is no headshot device. 
Can I configure fencing? How to do it? 


 [... reordered ... ]
>>
> Furthermore you are using custom resource agents of undisclosed
> quality and compatibility with the requirements:
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf>
>  
> https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc

> Since your resources come in isolated groups, I would go one
> by one, trying to figure out why the group won't run as expected.

> For instance:

>> primitive inetmanager inetmanager \
>> op monitor interval=10s timeout=160 \
>> op stop interval=0 timeout=60s on-fail=restart \
>> op start interval=0 timeout=60s on-fail=restart \
>> meta migration-threshold=2 failure-timeout=60s 
>> resource-stickiness=100
>> primitive inetmanager_vip IPaddr2 \
>> params ip=122.0.1.201 cidr_netmask=24 \
>> op start interval=0 timeout=20 \
>> op stop interval=0 timeout=20 \
>> op monitor timeout=20s interval=10s depth=0 \
>> meta migration-threshold=3 failure-timeout=60s
>> [...]
>> colocation inetmanager_col +inf: inetmanager_vip inetmanager
>> order inetmanager_order Mandatory: inetmanager inetmanager_vip
 [...]
 $ crm status
>> [...]
>> Full list of resources:
>> [...]
>>  inetmanager_vip(ocf::heartbeat:IPaddr2):   Stopped
>>  inetmanager(ocf::heartbeat:inetmanager):   Stopped
 [...]
 corosync.log of node 122.0.1.10
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd:  warning: 
>> status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed 
>> (target: 7 vs. rc: 1): Error
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
>> abort_transition_graph: Transition aborted by operation 
>> inetmanager_monitor_0 'modify' on 122.0.1.9: Event failed | 
>> magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 
>> source=match_graph_event:310 complete=false
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
>> match_graph_event:  Action inetmanager_monitor_0 (24) confirmed on 
>> 122.0.1.9 (rc=1)
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
>> process_graph_event:Detected action (360.24) 
>> inetmanager_monitor_0.2152=unknown error: failed
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd:  warning: 
>> status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed 
>> (target: 7 vs. rc: 1): Error
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
>> 

Re: [ClusterLabs] Pacemaker resources are not scheduled

2018-04-16 Thread Jan Pokorný
Lkxjtu,

On 14/04/18 00:16 +0800, lkxjtu wrote:
> My cluster version:
> Corosync 2.4.0
> Pacemaker 1.1.16
> 
> There are many resource anomalies. Some resources are only monitored
> and not recovered. Some resources are not monitored or recovered.
> Only one resource of vnm is scheduled normally, but this resource
> cannot be started because other resources in the cluster are
> abnormal. Just like a deadlock. I have been plagued by this problem
> for a long time. I just want a stable and highly available resource
> with infinite recovery for everyone. Is my resource configure
> correct?

see below

> $ cat /etc/corosync/corosync.conf
> [co]mpatibility: whitetank
> 
> [...]
> 
> logging {
>   fileline:off
>   to_stderr:   no
>   to_logfile:  yes
>   logfile: /root/info/logs/pacemaker_cluster/corosync.log
>   to_syslog:   yes
>   syslog_facility: daemon
>   syslog_priority: info
>   debug:   off
>   function_name:   on
>   timestamp:   on
>   logger_subsys {
> subsys: AMF
> debug:  off
> tags:   enter|leave|trace1|trace2|trace3|trace4|trace6
>   }
> }
> 
> amf {
>   mode: disabled
> }
> 
> aisexec {
>   user:  root
>   group: root
> }

You are apparently mixing configuration directives for older major
version(s) of corosync than you claim to be using.
See corosync_conf(5) + votequorum(5) man pages for what you are
supposed to configure with the actual version.

Regarding your pacemaker configuration:

> $ crm configure show
> 
> [... reordered ... ]
> 
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.16-12.el7-94ff4df \
> cluster-infrastructure=corosync \
> stonith-enabled=false \
> start-failure-is-fatal=false \
> load-threshold="3200%"

You are urged to configure fencing, otherwise asking for sane
cluster's behaviour (which you do) is out of question, unless
you precisely know why you are not configuring it.

> 
> [... reordered ... ]
> 

Furthermore you are using custom resource agents of undisclosed
quality and compatibility with the requirements:
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf
https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc

Since your resources come in isolated groups, I would go one
by one, trying to figure out why the group won't run as expected.

For instance:

> primitive inetmanager inetmanager \
> op monitor interval=10s timeout=160 \
> op stop interval=0 timeout=60s on-fail=restart \
> op start interval=0 timeout=60s on-fail=restart \
> meta migration-threshold=2 failure-timeout=60s resource-stickiness=100
> primitive inetmanager_vip IPaddr2 \
> params ip=122.0.1.201 cidr_netmask=24 \
> op start interval=0 timeout=20 \
> op stop interval=0 timeout=20 \
> op monitor timeout=20s interval=10s depth=0 \
> meta migration-threshold=3 failure-timeout=60s
> [...]
> colocation inetmanager_col +inf: inetmanager_vip inetmanager
> order inetmanager_order Mandatory: inetmanager inetmanager_vip
> 
> [...]
> 
> $ crm status
> [...]
> Full list of resources:
> [...]
>  inetmanager_vip(ocf::heartbeat:IPaddr2):   Stopped
>  inetmanager(ocf::heartbeat:inetmanager):   Stopped
> 
> [...]
> 
> corosync.log of node 122.0.1.10
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd:  warning: 
> status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed 
> (target: 7 vs. rc: 1): Error
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
> abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 
> 'modify' on 122.0.1.9: Event failed | 
> magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 
> source=match_graph_event:310 complete=false
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
> match_graph_event:  Action inetmanager_monitor_0 (24) confirmed on 
> 122.0.1.9 (rc=1)
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
> process_graph_event:Detected action (360.24) 
> inetmanager_monitor_0.2152=unknown error: failed
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd:  warning: 
> status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed 
> (target: 7 vs. rc: 1): Error
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
> abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 
> 'modify' on 122.0.1.9: Event failed | 
> magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 
> source=match_graph_event:310 complete=false
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
> match_graph_event:  Action inetmanager_monitor_0 (24) confirmed on 
> 122.0.1.9 (rc=1)
> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
>