Thanks for reply, Andreas
On Fri, Aug 5, 2016 at 1:48 AM, Andreas Kurz <andreas.k...@gmail.com> wrote: > Hi, > > On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov <koshi...@gmail.com> > wrote: > >> Hello list, >> >> Can you, please, help me in debugging 1 resource not being started after >> node failover ? >> >> Here is configuration that I'm testing: >> 3 nodes(kvm VM) cluster, that have: >> >> node 10: aic-controller-58055.test.domain.local >> node 6: aic-controller-50186.test.domain.local >> node 9: aic-controller-12993.test.domain.local >> primitive cmha cmha \ >> params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad" >> pidfile="/var/run/cmha/cmha.pid" user=cmha \ >> meta failure-timeout=30 resource-stickiness=1 target-role=Started >> migration-threshold=3 \ >> op monitor interval=10 on-fail=restart timeout=20 \ >> op start interval=0 on-fail=restart timeout=60 \ >> op stop interval=0 on-fail=block timeout=90 >> > > What is the output of crm_mon -1frA once a node is down ... any failed > actions? > No errors/failed actions. This is a little bit different lab(names changes), but have the same effect: root@aic-controller-57150:~# crm_mon -1frA Last updated: Fri Aug 5 20:14:05 2016 Last change: Fri Aug 5 19:38:34 2016 by root via crm_attribute on aic-controller-44151.test.domain.local Stack: corosync Current DC: aic-controller-57150.test.domain.local (version 1.1.14-70404b0) - partition with quorum 3 nodes and 7 resources configured Online: [ aic-controller-57150.test.domain.local aic-controller-58381.test.domain.local ] OFFLINE: [ aic-controller-44151.test.domain.local ] Full list of resources: sysinfo_aic-controller-44151.test.domain.local (ocf::pacemaker:SysInfo): Stopped sysinfo_aic-controller-57150.test.domain.local (ocf::pacemaker:SysInfo): Started aic-controller-57150.test.domain.local sysinfo_aic-controller-58381.test.domain.local (ocf::pacemaker:SysInfo): Started aic-controller-58381.test.domain.local Clone Set: clone_p_heat-engine [p_heat-engine] Started: [ aic-controller-57150.test.domain.local aic-controller-58381.test.domain.local ] cmha (ocf::heartbeat:cmha): Stopped Node Attributes: * Node aic-controller-57150.test.domain.local: + arch : x86_64 + cpu_cores : 3 + cpu_info : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz + cpu_load : 1.04 + cpu_speed : 4994.21 + free_swap : 5150 + os : Linux-3.13.0-85-generic + ram_free : 750 + ram_total : 5000 + root_free : 45932 + var_log_free : 431543 * Node aic-controller-58381.test.domain.local: + arch : x86_64 + cpu_cores : 3 + cpu_info : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz + cpu_load : 1.16 + cpu_speed : 4994.21 + free_swap : 5150 + os : Linux-3.13.0-85-generic + ram_free : 750 + ram_total : 5000 + root_free : 45932 + var_log_free : 431542 Migration Summary: * Node aic-controller-57150.test.domain.local: * Node aic-controller-58381.test.domain.local: > >> primitive sysinfo_aic-controller-12993.test.domain.local >> ocf:pacemaker:SysInfo \ >> params disk_unit=M disks="/ /var/log" min_disk_free=512M \ >> op monitor interval=15s >> primitive sysinfo_aic-controller-50186.test.domain.local >> ocf:pacemaker:SysInfo \ >> params disk_unit=M disks="/ /var/log" min_disk_free=512M \ >> op monitor interval=15s >> primitive sysinfo_aic-controller-58055.test.domain.local >> ocf:pacemaker:SysInfo \ >> params disk_unit=M disks="/ /var/log" min_disk_free=512M \ >> op monitor interval=15s >> > > You can use a clone for this sysinfo resource and a symmetric cluster for > a more compact configuration .... then you can skip all these location > constraints. > > >> location cmha-on-aic-controller-12993.test.domain.local cmha 100: >> aic-controller-12993.test.domain.local >> location cmha-on-aic-controller-50186.test.domain.local cmha 100: >> aic-controller-50186.test.domain.local >> location cmha-on-aic-controller-58055.test.domain.local cmha 100: >> aic-controller-58055.test.domain.local >> location sysinfo-on-aic-controller-12993.test.domain.local >> sysinfo_aic-controller-12993.test.domain.local inf: >> aic-controller-12993.test.domain.local >> location sysinfo-on-aic-controller-50186.test.domain.local >> sysinfo_aic-controller-50186.test.domain.local inf: >> aic-controller-50186.test.domain.local >> location sysinfo-on-aic-controller-58055.test.domain.local >> sysinfo_aic-controller-58055.test.domain.local inf: >> aic-controller-58055.test.domain.local >> property cib-bootstrap-options: \ >> have-watchdog=false \ >> dc-version=1.1.14-70404b0 \ >> cluster-infrastructure=corosync \ >> cluster-recheck-interval=15s \ >> > > Never tried such a low cluster-recheck-interval ... wouldn't do that. I > saw setups with low intervals burning a lot of cpu cycles in bigger cluster > setups and side-effects from aborted transitions. If you do this for > "cleanup" the cluster state because you see resource-agent errors you > should better fix the resource agent. > This small interval is result of debugging cmha resource issue. In general all cluster have 190s, and because 15s didn't help - it will be rollback. > > Regards, > Andreas > > >> no-quorum-policy=stop \ >> stonith-enabled=false \ >> start-failure-is-fatal=false \ >> symmetric-cluster=false \ >> node-health-strategy=migrate-on-red \ >> last-lrm-refresh=1470334410 >> >> When 3 nodes online, everything seemed OK, this is output of scoreshow.sh: >> Resource Score Node >> Stickiness #Fail Migration-Threshold >> cmha -INFINITY >> aic-controller-12993.test.domain.local 1 0 >> cmha 101 >> aic-controller-50186.test.domain.local 1 0 >> cmha -INFINITY >> aic-controller-58055.test.domain.local 1 0 >> sysinfo_aic-controller-12993.test.domain.local INFINITY >> aic-controller-12993.test.domain.local 0 0 >> sysinfo_aic-controller-50186.test.domain.local -INFINITY >> aic-controller-50186.test.domain.local 0 0 >> sysinfo_aic-controller-58055.test.domain.local INFINITY >> aic-controller-58055.test.domain.local 0 0 >> >> The problem starts when 1 node, goes offline (aic-controller-50186). The >> resource cmha is stocked in stopped state. >> Here is the showscores: >> Resource Score Node >> Stickiness #Fail Migration-Threshold >> cmha -INFINITY >> aic-controller-12993.test.domain.local 1 0 >> cmha -INFINITY >> aic-controller-50186.test.domain.local 1 0 >> cmha -INFINITY >> aic-controller-58055.test.domain.local 1 0 >> >> Even it has target-role=Started pacemaker skipping this resource. And in >> logs I see: >> pengine: info: native_print: cmha (ocf::heartbeat:cmha): >> Stopped >> pengine: info: native_color: Resource cmha cannot run anywhere >> pengine: info: LogActions: Leave cmha (Stopped) >> >> To recover cmha resource I need to run either: >> 1) crm resource cleanup cmha >> 2) crm resource reprobe >> >> After any of the above commands, resource began to be picked up be >> pacemaker and I see valid scores: >> Resource Score Node >> Stickiness #Fail Migration-Threshold >> cmha 100 >> aic-controller-58055.test.domain.local 1 0 3 >> cmha 101 >> aic-controller-12993.test.domain.local 1 0 3 >> cmha -INFINITY >> aic-controller-50186.test.domain.local 1 0 3 >> >> So the questions here - why cluster-recheck doesn't work, and should it >> do reprobing ? >> How to make migration work or what I missed in configuration that >> prevents migration? >> >> corosync 2.3.4 >> pacemaker 1.1.14 >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> >> > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org