Re: [ClusterLabs] 答复: No slave is promoted to be master
Отправлено с iPhone > 17 апр. 2018 г., в 7:16, 范国腾написал(а): > > I check the status again. It is not not promoted but it promoted about 15 > minutes after the cluster starts. > > I try in three labs and the results are same: The promotion happens 15 > minutes after the cluster starts. > > Why is there about 15 minutes delay every time? > That rings the bell. 15 minutes is default interval for time based rules re-evaluation; my understanding so far is that it alarm triggers other configuration changes (basically it runs policy engine to make decision). I had similar effect when I attempted to change quorum state directly, without going via external node events. So it looks like whatever sets master scores does not trigger policy engine. > > Apr 16 22:08:32 node1 attrd[16618]: notice: Node sds1 state is now member > Apr 16 22:08:32 node1 attrd[16618]: notice: Node sds2 state is now member > > .. > > Apr 16 22:21:36 node1 pgsqlms(pgsqld)[18230]: INFO: Execute action monitor > and the result 0 > Apr 16 22:21:52 node1 pgsqlms(pgsqld)[18257]: INFO: Execute action monitor > and the result 0 > Apr 16 22:22:09 node1 pgsqlms(pgsqld)[18296]: INFO: Execute action monitor > and the result 0 > Apr 16 22:22:25 node1 pgsqlms(pgsqld)[18315]: INFO: Execute action monitor > and the result 0 > Apr 16 22:22:41 node1 pgsqlms(pgsqld)[18343]: INFO: Execute action monitor > and the result 0 > Apr 16 22:22:57 node1 pgsqlms(pgsqld)[18362]: INFO: Execute action monitor > and the result 0 > Apr 16 22:23:13 node1 pgsqlms(pgsqld)[18402]: INFO: Execute action monitor > and the result 0 > Apr 16 22:23:29 node1 pgsqlms(pgsqld)[18421]: INFO: Execute action monitor > and the result 0 > Apr 16 22:23:45 node1 pgsqlms(pgsqld)[18449]: INFO: Execute action monitor > and the result 0 > Apr 16 22:23:57 node1 crmd[16620]: notice: State transition S_IDLE -> > S_POLICY_ENGINE > Apr 16 22:23:57 node1 pengine[16619]: notice: Promote pgsqld:0#011(Slave -> > Master sds1) > Apr 16 22:23:57 node1 pengine[16619]: notice: Start master-vip#011(sds1) > Apr 16 22:23:57 node1 pengine[16619]: notice: Start > pgsql-master-ip#011(sds1) > Apr 16 22:23:57 node1 pengine[16619]: notice: Calculated transition 1, > saving inputs in /var/lib/pacemaker/pengine/pe-input-18.bz2 > Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating cancel operation > pgsqld_monitor_16000 locally on sds1 > Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating notify operation > pgsqld_pre_notify_promote_0 locally on sds1 > Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating notify operation > pgsqld_pre_notify_promote_0 on sds2 > Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Promoting instance on > node "sds1" > Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Current node TL#LSN: > 4#117440512 > Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Execute action notify and > the result 0 > Apr 16 22:23:58 node1 crmd[16620]: notice: Result of notify operation for > pgsqld on sds1: 0 (ok) > Apr 16 22:23:58 node1 crmd[16620]: notice: Initiating promote operation > pgsqld_promote_0 locally on sds1 > Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18499]: INFO: Waiting for the promote > to complete > Apr 16 22:23:59 node1 pgsqlms(pgsqld)[18499]: INFO: Promote complete > > > > [root@node1 ~]# crm_simulate -sL > > Current cluster status: > Online: [ sds1 sds2 ] > > Master/Slave Set: pgsql-ha [pgsqld] > Masters: [ sds1 ] > Slaves: [ sds2 ] > Resource Group: mastergroup > master-vip (ocf::heartbeat:IPaddr2): Started sds1 > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started sds1 > > Allocation scores: > clone_color: pgsql-ha allocation score on sds1: 1 > clone_color: pgsql-ha allocation score on sds2: 1 > clone_color: pgsqld:0 allocation score on sds1: 1003 > clone_color: pgsqld:0 allocation score on sds2: 1 > clone_color: pgsqld:1 allocation score on sds1: 1 > clone_color: pgsqld:1 allocation score on sds2: 1002 > native_color: pgsqld:0 allocation score on sds1: 1003 > native_color: pgsqld:0 allocation score on sds2: 1 > native_color: pgsqld:1 allocation score on sds1: -INFINITY > native_color: pgsqld:1 allocation score on sds2: 1002 > pgsqld:0 promotion score on sds1: 1002 > pgsqld:1 promotion score on sds2: 1001 > group_color: mastergroup allocation score on sds1: 0 > group_color: mastergroup allocation score on sds2: 0 > group_color: master-vip allocation score on sds1: 0 > group_color: master-vip allocation score on sds2: 0 > native_color: master-vip allocation score on sds1: 1003 > native_color: master-vip allocation score on sds2: -INFINITY > native_color: pgsql-master-ip allocation score on sds1: 1003 > native_color: pgsql-master-ip allocation score on sds2: -INFINITY > > Transition Summary: > [root@node1 ~]# > > You could reproduce the issue in two nodes, and execute the following > command. Then run "pcs cluster stop --all" and "pcs cluster start --all". > > pcs
[ClusterLabs] 答复: No slave is promoted to be master
I check the status again. It is not not promoted but it promoted about 15 minutes after the cluster starts. I try in three labs and the results are same: The promotion happens 15 minutes after the cluster starts. Why is there about 15 minutes delay every time? Apr 16 22:08:32 node1 attrd[16618]: notice: Node sds1 state is now member Apr 16 22:08:32 node1 attrd[16618]: notice: Node sds2 state is now member .. Apr 16 22:21:36 node1 pgsqlms(pgsqld)[18230]: INFO: Execute action monitor and the result 0 Apr 16 22:21:52 node1 pgsqlms(pgsqld)[18257]: INFO: Execute action monitor and the result 0 Apr 16 22:22:09 node1 pgsqlms(pgsqld)[18296]: INFO: Execute action monitor and the result 0 Apr 16 22:22:25 node1 pgsqlms(pgsqld)[18315]: INFO: Execute action monitor and the result 0 Apr 16 22:22:41 node1 pgsqlms(pgsqld)[18343]: INFO: Execute action monitor and the result 0 Apr 16 22:22:57 node1 pgsqlms(pgsqld)[18362]: INFO: Execute action monitor and the result 0 Apr 16 22:23:13 node1 pgsqlms(pgsqld)[18402]: INFO: Execute action monitor and the result 0 Apr 16 22:23:29 node1 pgsqlms(pgsqld)[18421]: INFO: Execute action monitor and the result 0 Apr 16 22:23:45 node1 pgsqlms(pgsqld)[18449]: INFO: Execute action monitor and the result 0 Apr 16 22:23:57 node1 crmd[16620]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 16 22:23:57 node1 pengine[16619]: notice: Promote pgsqld:0#011(Slave -> Master sds1) Apr 16 22:23:57 node1 pengine[16619]: notice: Start master-vip#011(sds1) Apr 16 22:23:57 node1 pengine[16619]: notice: Start pgsql-master-ip#011(sds1) Apr 16 22:23:57 node1 pengine[16619]: notice: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-18.bz2 Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating cancel operation pgsqld_monitor_16000 locally on sds1 Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating notify operation pgsqld_pre_notify_promote_0 locally on sds1 Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating notify operation pgsqld_pre_notify_promote_0 on sds2 Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Promoting instance on node "sds1" Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Current node TL#LSN: 4#117440512 Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Execute action notify and the result 0 Apr 16 22:23:58 node1 crmd[16620]: notice: Result of notify operation for pgsqld on sds1: 0 (ok) Apr 16 22:23:58 node1 crmd[16620]: notice: Initiating promote operation pgsqld_promote_0 locally on sds1 Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18499]: INFO: Waiting for the promote to complete Apr 16 22:23:59 node1 pgsqlms(pgsqld)[18499]: INFO: Promote complete [root@node1 ~]# crm_simulate -sL Current cluster status: Online: [ sds1 sds2 ] Master/Slave Set: pgsql-ha [pgsqld] Masters: [ sds1 ] Slaves: [ sds2 ] Resource Group: mastergroup master-vip (ocf::heartbeat:IPaddr2): Started sds1 pgsql-master-ip(ocf::heartbeat:IPaddr2): Started sds1 Allocation scores: clone_color: pgsql-ha allocation score on sds1: 1 clone_color: pgsql-ha allocation score on sds2: 1 clone_color: pgsqld:0 allocation score on sds1: 1003 clone_color: pgsqld:0 allocation score on sds2: 1 clone_color: pgsqld:1 allocation score on sds1: 1 clone_color: pgsqld:1 allocation score on sds2: 1002 native_color: pgsqld:0 allocation score on sds1: 1003 native_color: pgsqld:0 allocation score on sds2: 1 native_color: pgsqld:1 allocation score on sds1: -INFINITY native_color: pgsqld:1 allocation score on sds2: 1002 pgsqld:0 promotion score on sds1: 1002 pgsqld:1 promotion score on sds2: 1001 group_color: mastergroup allocation score on sds1: 0 group_color: mastergroup allocation score on sds2: 0 group_color: master-vip allocation score on sds1: 0 group_color: master-vip allocation score on sds2: 0 native_color: master-vip allocation score on sds1: 1003 native_color: master-vip allocation score on sds2: -INFINITY native_color: pgsql-master-ip allocation score on sds1: 1003 native_color: pgsql-master-ip allocation score on sds2: -INFINITY Transition Summary: [root@node1 ~]# You could reproduce the issue in two nodes, and execute the following command. Then run "pcs cluster stop --all" and "pcs cluster start --all". pcs resource create pgsqld ocf:heartbeat:pgsqlms bindir=/home/highgo/highgo/database/4.3.1/bin pgdata=/home/highgo/highgo/database/4.3.1/data op start timeout=600s op stop timeout=60s op promote timeout=300s op demote timeout=120s op monitor interval=10s timeout=100s role="Master" op monitor interval=16s timeout=100s role="Slave" op notify timeout=60s pcs resource master pgsql-ha pgsqld notify=true interleave=true -邮件原件- 发件人: 范国腾 发送时间: 2018年4月17日 10:25 收件人: 'Jehan-Guillaume de Rorthais'抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: [ClusterLabs] No slave is promoted to be master Hi, We install a new lab which
[ClusterLabs] No slave is promoted to be master
Hi, We install a new lab which only have the postgres resource and the vip resource. After the cluster is installed, the status is ok: only node is master and the other is slave. Then I run "pcs cluster stop --all" to close the cluster and then I run the "pcs cluster start --all" to start the cluster. All of the pgsql is slave status and they could not be promoted to be master any more like this: Master/Slave Set: pgsql-ha [pgsqld] Slaves: [ sds1 sds2 ] There is no error in the log and the " crm_simulate -sL" show the flowing and it seems that the score is ok too. The detailed log and config is in the attachment. [root@node1 ~]# crm_simulate -sL Current cluster status: Online: [ sds1 sds2 ] Master/Slave Set: pgsql-ha [pgsqld] Slaves: [ sds1 sds2 ] Resource Group: mastergroup master-vip (ocf::heartbeat:IPaddr2): Stopped pgsql-master-ip(ocf::heartbeat:IPaddr2): Stopped Allocation scores: clone_color: pgsql-ha allocation score on sds1: 1 clone_color: pgsql-ha allocation score on sds2: 1 clone_color: pgsqld:0 allocation score on sds1: 1003 clone_color: pgsqld:0 allocation score on sds2: 1 clone_color: pgsqld:1 allocation score on sds1: 1 clone_color: pgsqld:1 allocation score on sds2: 1002 native_color: pgsqld:0 allocation score on sds1: 1003 native_color: pgsqld:0 allocation score on sds2: 1 native_color: pgsqld:1 allocation score on sds1: -INFINITY native_color: pgsqld:1 allocation score on sds2: 1002 pgsqld:0 promotion score on sds1: 1002 pgsqld:1 promotion score on sds2: 1001 group_color: mastergroup allocation score on sds1: 0 group_color: mastergroup allocation score on sds2: 0 group_color: master-vip allocation score on sds1: 0 group_color: master-vip allocation score on sds2: 0 native_color: master-vip allocation score on sds1: 1003 native_color: master-vip allocation score on sds2: -INFINITY native_color: pgsql-master-ip allocation score on sds1: 1003 native_color: pgsql-master-ip allocation score on sds2: -INFINITY Transition Summary: * Promote pgsqld:0 (Slave -> Master sds1) * Start master-vip (sds1) * Start pgsql-master-ip (sds1) log.rar Description: log.rar ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] attrd/attrd_updater asynchronous behavior
I got an answer on IRC from Ken Gaillot. Bellow his answer for tracking purpose. On Mon, 16 Apr 2018 23:28:39 +0200 Jehan-Guillaume de Rorthaiswrote: [...] > * does looping until the value becomes available is enough to conclude all > other node have the same value? Or is available only locally on the action's > node and not yet "replicated" to other nodes? kgaillot: « That issue has come up recently in a different context. you are correct, currently there is no guarantee that the value has been set anywhere, and looping until the query comes back ensures that the new value is in the local attrd only. The solution will probably be to offer something like a --wait option that doesn't return until the value is available (maybe locally, maybe everywhere, or maybe that's part of the option). » I'll fill a bz for tracking purpose of such feature as discussed on IRC. > * any other suggestions about how we could share values synchronously with all > other nodes? Any suggestion is very welcome... ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] attrd/attrd_updater asynchronous behavior
Hi, I have a question in regard with attrd asynchronous behavior In PAF, during the election process to pick the best PgSQL master, we are using private attributes to publish the status (LSN) of each pgsql instances during the pre-promote action. Because we need these LSN from each nodes during the promote action, each time we are calling attrd_updater --name blah --update x we have a loop running attrd_updater --name blah --query until the fetched value is the same than the one we set. We basically tried to force a synchronous behavior. See: https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L310 But, we have an issue on github that makes me think this might not be enough to make sure all the private attributes becomes available among the cluster during the pre-promote action and before the promote action is triggered. See: https://github.com/ClusterLabs/PAF/issues/131 In this issue, a simple switchover fails (pcs move) during the designated slave promotion action, because it couldn't check all other nodes LSN: ocf-exit-reason:Can not get LSN location for "pg1-dev" * does looping until the value becomes available is enough to conclude all other node have the same value? Or is available only locally on the action's node and not yet "replicated" to other nodes? * any other suggestions about how we could share values synchronously with all other nodes? Thanks for your help, ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker resources are not scheduled
On Mon, 2018-04-16 at 23:52 +0800, lkxjtu wrote: > > Lkxjtu, > > > On 14/04/18 00:16 +0800, lkxjtu wrote: > >> My cluster version: > >> Corosync 2.4.0 > >> Pacemaker 1.1.16 > >> > >> There are many resource anomalies. Some resources are only > monitored > >> and not recovered. Some resources are not monitored or recovered. > >> Only one resource of vnm is scheduled normally, but this resource > >> cannot be started because other resources in the cluster are > >> abnormal. Just like a deadlock. I have been plagued by this > problem > >> for a long time. I just want a stable and highly available > resource > >> with infinite recovery for everyone. Is my resource configure > >> correct? > > > see below > > >> $ cat /etc/corosync/corosync.conf > >> [co]mpatibility: whitetank > >> > >> [...] > >> > >> logging { > >> fileline:off > >> to_stderr: no > >> to_logfile: yes > >> logfile: /root/info/logs/pacemaker_cluster/corosync.log > >> to_syslog: yes > >> syslog_facility: daemon > >> syslog_priority: info > >> debug: off > >> function_name: on > >> timestamp: on > >> logger_subsys { > >> subsys: AMF > >> debug: off > >> tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > >> } > >> } > >> > >> amf { > >> mode: disabled > >> } > >> > >> aisexec { > >> user: root > >> group: root > >> } > > > You are apparently mixing configuration directives for older major > > version(s) of corosync than you claim to be using. > > See corosync_conf(5) + votequorum(5) man pages for what you are > > supposed to configure with the actual version. > > > Thank you for your detailed answer! > Corosync.conf is part of the ansible scripts, but corosync and > pacemaker are updated with the yum source. So it has caused the > current gap. I will carefully compare the gap between the new and old > versions > > > > Regarding your pacemaker configuration: > > >> $ crm configure show > >> > >> [... reordered ... ] > >> > >> property cib-bootstrap-options: \ > >> have-watchdog=false \ > >> dc-version=1.1.16-12.el7-94ff4df \ > >> cluster-infrastructure=corosync \ > >> stonith-enabled=false \ > >> start-failure-is-fatal=false \ > >> load-threshold="3200%" > > > You are urged to configure fencing, otherwise asking for sane > > cluster's behaviour (which you do) is out of question, unless > > you precisely know why you are not configuring it. > > > My environment is a virtual machine environment. There is no headshot > device. Can I configure fencing? How to do it? If you have access to the physical host, see fence_virtd and the fence_xvm fence agent. They implement fencing by having the hypervisor kill the VM. There are some limitations to that approach. If the host itself is dead, then the fencing will fail. So it makes the most sense when all the VMs are on a single host -- but that introduces a single point of failure. If you don't have access the physical host, see if you have access to some sort of VM management API. If so, you can write a fence agent that calls the API to kill the VM (fence agents already exist for some public cloud providers). > >> > >> [... reordered ... ] > >> > > > Furthermore you are using custom resource agents of undisclosed > > quality and compatibility with the requirements: > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-sing > le/Pacemaker_Explained/index.html#ap-ocf > > https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev- > guides/ra-dev-guide.asc > > > Since your resources come in isolated groups, I would go one > > by one, trying to figure out why the group won't run as expected. > > > For instance: > > >> primitive inetmanager inetmanager \ > >> op monitor interval=10s timeout=160 \ > >> op stop interval=0 timeout=60s on-fail=restart \ > >> op start interval=0 timeout=60s on-fail=restart \ > >> meta migration-threshold=2 failure-timeout=60s resource- > stickiness=100 > >> primitive inetmanager_vip IPaddr2 \ > >> params ip=122.0.1.201 cidr_netmask=24 \ > >> op start interval=0 timeout=20 \ > >> op stop interval=0 timeout=20 \ > >> op monitor timeout=20s interval=10s depth=0 \ > >> meta migration-threshold=3 failure-timeout=60s > >> [...] > >> colocation inetmanager_col +inf: inetmanager_vip inetmanager > >> order inetmanager_order Mandatory: inetmanager inetmanager_vip > >> > >> [...] > >> > >> $ crm status > >> [...] > >> Full list of resources: > >> [...] > >> inetmanager_vip(ocf::heartbeat:IPaddr2): Stopped > >> inetmanager(ocf::heartbeat:inetmanager): Stopped > >> > >> [...] > >> > >> corosync.log of node 122.0.1.10 > >> Apr 13 23:49:56 [6137] paas-controller-122-0-1- > 10 crmd: warning: status_from_rc: Action 24 > (inetmanager_monitor_0) on 122.0.1.9 failed (target: 7 vs. rc: 1): > Error > >> Apr
Re: [ClusterLabs] Pacemaker resources are not scheduled
> Lkxjtu, > On 14/04/18 00:16 +0800, lkxjtu wrote: >> My cluster version: >> Corosync 2.4.0 >> Pacemaker 1.1.16 There are many resource anomalies. Some resources are only monitored >> and not recovered. Some resources are not monitored or recovered. >> Only one resource of vnm is scheduled normally, but this resource >> cannot be started because other resources in the cluster are >> abnormal. Just like a deadlock. I have been plagued by this problem >> for a long time. I just want a stable and highly available resource >> with infinite recovery for everyone. Is my resource configure >> correct? > see below >> $ cat /etc/corosync/corosync.conf >> [co]mpatibility: whitetank [...] logging { >> fileline:off >> to_stderr: no >> to_logfile: yes >> logfile: /root/info/logs/pacemaker_cluster/corosync.log >> to_syslog: yes >> syslog_facility: daemon >> syslog_priority: info >> debug: off >> function_name: on >> timestamp: on >> logger_subsys { >> subsys: AMF >> debug: off >> tags: enter|leave|trace1|trace2|trace3|trace4|trace6 >> } >> } amf { >> mode: disabled >> } aisexec { >> user: root >> group: root >> } > You are apparently mixing configuration directives for older major > version(s) of corosync than you claim to be using. > See corosync_conf(5) + votequorum(5) man pages for what you are > supposed to configure with the actual version. Thank you for your detailed answer! Corosync.conf is part of the ansible scripts, but corosync and pacemaker are updated with the yum source. So it has caused the current gap. I will carefully compare the gap between the new and old versions > Regarding your pacemaker configuration: >> $ crm configure show [... reordered ... ] property cib-bootstrap-options: \ >> have-watchdog=false \ >> dc-version=1.1.16-12.el7-94ff4df \ >> cluster-infrastructure=corosync \ >> stonith-enabled=false \ >> start-failure-is-fatal=false \ >> load-threshold="3200%" > You are urged to configure fencing, otherwise asking for sane > cluster's behaviour (which you do) is out of question, unless > you precisely know why you are not configuring it. My environment is a virtual machine environment. There is no headshot device. Can I configure fencing? How to do it? [... reordered ... ] >> > Furthermore you are using custom resource agents of undisclosed > quality and compatibility with the requirements: > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf> > > https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc > Since your resources come in isolated groups, I would go one > by one, trying to figure out why the group won't run as expected. > For instance: >> primitive inetmanager inetmanager \ >> op monitor interval=10s timeout=160 \ >> op stop interval=0 timeout=60s on-fail=restart \ >> op start interval=0 timeout=60s on-fail=restart \ >> meta migration-threshold=2 failure-timeout=60s >> resource-stickiness=100 >> primitive inetmanager_vip IPaddr2 \ >> params ip=122.0.1.201 cidr_netmask=24 \ >> op start interval=0 timeout=20 \ >> op stop interval=0 timeout=20 \ >> op monitor timeout=20s interval=10s depth=0 \ >> meta migration-threshold=3 failure-timeout=60s >> [...] >> colocation inetmanager_col +inf: inetmanager_vip inetmanager >> order inetmanager_order Mandatory: inetmanager inetmanager_vip [...] $ crm status >> [...] >> Full list of resources: >> [...] >> inetmanager_vip(ocf::heartbeat:IPaddr2): Stopped >> inetmanager(ocf::heartbeat:inetmanager): Stopped [...] corosync.log of node 122.0.1.10 >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: >> status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed >> (target: 7 vs. rc: 1): Error >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: >> abort_transition_graph: Transition aborted by operation >> inetmanager_monitor_0 'modify' on 122.0.1.9: Event failed | >> magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 >> source=match_graph_event:310 complete=false >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: >> match_graph_event: Action inetmanager_monitor_0 (24) confirmed on >> 122.0.1.9 (rc=1) >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: >> process_graph_event:Detected action (360.24) >> inetmanager_monitor_0.2152=unknown error: failed >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: >> status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed >> (target: 7 vs. rc: 1): Error >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: >>
Re: [ClusterLabs] Booth fail-over conditions
Zach Andersonwrites: > Hey all, > > new user to pacemaker/booth and I'm fumbling my way through my first proof > of concept. I have a 2 site configuration setup with local pacemaker > clusters at each site (running rabbitmq) and a booth arbitrator. I've > successfully validated the base failover when the "granted" site has > failed. My question is if there are any other ways to configure failover, > i.e. using resource health checks or the like? > Hi Zach, Do you mean that a resource health check should trigger site failover? That's actually something I'm not sure comes built-in.. though making a resource agent which revokes a ticket on failure should be fairly straight-forward. You could then group your resource which the ticket resource to enable this functionality. The logic in the ticket resource ought to be something like "if monitor fails and the current site is granted, then revoke the ticket, else do nothing". You would probably want to handle probe monitor invocations differently. There is a ocf_is_probe function provided to help with this. Cheers, Kristoffer > Thanks! > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker resources are not scheduled
Lkxjtu, On 14/04/18 00:16 +0800, lkxjtu wrote: > My cluster version: > Corosync 2.4.0 > Pacemaker 1.1.16 > > There are many resource anomalies. Some resources are only monitored > and not recovered. Some resources are not monitored or recovered. > Only one resource of vnm is scheduled normally, but this resource > cannot be started because other resources in the cluster are > abnormal. Just like a deadlock. I have been plagued by this problem > for a long time. I just want a stable and highly available resource > with infinite recovery for everyone. Is my resource configure > correct? see below > $ cat /etc/corosync/corosync.conf > [co]mpatibility: whitetank > > [...] > > logging { > fileline:off > to_stderr: no > to_logfile: yes > logfile: /root/info/logs/pacemaker_cluster/corosync.log > to_syslog: yes > syslog_facility: daemon > syslog_priority: info > debug: off > function_name: on > timestamp: on > logger_subsys { > subsys: AMF > debug: off > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > } > } > > amf { > mode: disabled > } > > aisexec { > user: root > group: root > } You are apparently mixing configuration directives for older major version(s) of corosync than you claim to be using. See corosync_conf(5) + votequorum(5) man pages for what you are supposed to configure with the actual version. Regarding your pacemaker configuration: > $ crm configure show > > [... reordered ... ] > > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.16-12.el7-94ff4df \ > cluster-infrastructure=corosync \ > stonith-enabled=false \ > start-failure-is-fatal=false \ > load-threshold="3200%" You are urged to configure fencing, otherwise asking for sane cluster's behaviour (which you do) is out of question, unless you precisely know why you are not configuring it. > > [... reordered ... ] > Furthermore you are using custom resource agents of undisclosed quality and compatibility with the requirements: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc Since your resources come in isolated groups, I would go one by one, trying to figure out why the group won't run as expected. For instance: > primitive inetmanager inetmanager \ > op monitor interval=10s timeout=160 \ > op stop interval=0 timeout=60s on-fail=restart \ > op start interval=0 timeout=60s on-fail=restart \ > meta migration-threshold=2 failure-timeout=60s resource-stickiness=100 > primitive inetmanager_vip IPaddr2 \ > params ip=122.0.1.201 cidr_netmask=24 \ > op start interval=0 timeout=20 \ > op stop interval=0 timeout=20 \ > op monitor timeout=20s interval=10s depth=0 \ > meta migration-threshold=3 failure-timeout=60s > [...] > colocation inetmanager_col +inf: inetmanager_vip inetmanager > order inetmanager_order Mandatory: inetmanager inetmanager_vip > > [...] > > $ crm status > [...] > Full list of resources: > [...] > inetmanager_vip(ocf::heartbeat:IPaddr2): Stopped > inetmanager(ocf::heartbeat:inetmanager): Stopped > > [...] > > corosync.log of node 122.0.1.10 > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: > status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed > (target: 7 vs. rc: 1): Error > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 > 'modify' on 122.0.1.9: Event failed | > magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 > source=match_graph_event:310 complete=false > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > match_graph_event: Action inetmanager_monitor_0 (24) confirmed on > 122.0.1.9 (rc=1) > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > process_graph_event:Detected action (360.24) > inetmanager_monitor_0.2152=unknown error: failed > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: > status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed > (target: 7 vs. rc: 1): Error > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 > 'modify' on 122.0.1.9: Event failed | > magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 > source=match_graph_event:310 complete=false > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > match_graph_event: Action inetmanager_monitor_0 (24) confirmed on > 122.0.1.9 (rc=1) > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: >
Re: [ClusterLabs] HALVM monitor action fail on slave node. Possible bug?
Hi Emmanuel, thank you for you support. I did a lot of checks during the WE and there are some updates: - Main problem is that ocf:heartbeat:LVM is old. The current version on centos 7 is 3.9.5 (package resource-agents). More precisely, in 3.9.5 the monitor function has one important assumption: the underlying storage is shared between all nodes in the cluster. So the monitor function checks the presence of the volume group on all nodes. From version 3.9.6 this is not the normal behavior and the monitor function (LVM_status) returns $OCF_NOT_RUNNING from slaves nodes without errors. You can check this in the file /usr/lib/ocf/resource.d/heartbeat/LVM in lines 340-351 that disappears in version 3.9.6. Obviously this is not error, but an important change in the cluster architecture because I need to use drbd in dual primary mode when version 3.9.5 is used. My personal idea is that drbd in dual primary mode with lvm is not a good idea due to the fact that I don't need an active/active cluster. Anyway, thank you for your time again Marco 2018-04-13 15:54 GMT+02:00 emmanuel segura: > the first thing that you need to configure is the stonith, because you > have this constraint "constraint order promote DrbdResClone then start > HALVM" > > To recover and promote drbd to master when you crash a node, configurare > the drbd fencing handler. > > pacemaker execute monitor in both nodes, so this is normal, to test why > monitor fail, use ocf-tester > > 2018-04-13 15:29 GMT+02:00 Marco Marino : > >> Hello, I'm trying to configure a simple 2 node cluster with drbd and >> HALVM (ocf:heartbeat:LVM) but I have a problem that I'm not able to solve, >> to I decided to write this long post. I need to really understand what I'm >> doing and where I'm doing wrong. >> More precisely, I'm configuring a pacemaker cluster with 2 nodes and only >> one drbd resource. Here all operations: >> >> - System configuration >> hostnamectl set-hostname pcmk[12] >> yum update -y >> yum install vim wget git -y >> vim /etc/sysconfig/selinux -> permissive mode >> systemctl disable firewalld >> reboot >> >> - Network configuration >> [pcmk1] >> nmcli connection modify corosync ipv4.method manual ipv4.addresses >> 192.168.198.201/24 ipv6.method ignore connection.autoconnect yes >> nmcli connection modify replication ipv4.method manual ipv4.addresses >> 192.168.199.201/24 ipv6.method ignore connection.autoconnect yes >> [pcmk2] >> nmcli connection modify corosync ipv4.method manual ipv4.addresses >> 192.168.198.202/24 ipv6.method ignore connection.autoconnect yes >> nmcli connection modify replication ipv4.method manual ipv4.addresses >> 192.168.199.202/24 ipv6.method ignore connection.autoconnect yes >> >> ssh-keyget -t rsa >> ssh-copy-id root@pcmk[12] >> scp /etc/hosts root@pcmk2:/etc/hosts >> >> - Drbd Repo configuration and drbd installation >> rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org >> rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch >> .rpm >> yum update -y >> yum install drbd84-utils kmod-drbd84 -y >> >> - Drbd Configuration: >> Creating a new partition on top of /dev/vdb -> /dev/vdb1 of type >> "Linux" (83) >> [/etc/drbd.d/global_common.conf] >> usage-count no; >> [/etc/drbd.d/myres.res] >> resource myres { >> on pcmk1 { >> device /dev/drbd0; >> disk /dev/vdb1; >> address 192.168.199.201:7789; >> meta-disk internal; >> } >> on pcmk2 { >> device /dev/drbd0; >> disk /dev/vdb1; >> address 192.168.199.202:7789; >> meta-disk internal; >> } >> } >> >> scp /etc/drbd.d/myres.res root@pcmk2:/etc/drbd.d/myres.res >> systemctl start drbd <-- only for test. The service is disabled at >> boot! >> drbdadm create-md myres >> drbdadm up myres >> drbdadm primary --force myres >> >> - LVM Configuration >> [root@pcmk1 ~]# lsblk >> NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT >> sr0 11:01 1024M 0 rom >> vda 252:00 20G 0 disk >> ├─vda1 252:101G 0 part /boot >> └─vda2 252:20 19G 0 part >> ├─cl-root 253:00 17G 0 lvm / >> └─cl-swap 253:102G 0 lvm [SWAP] >> vdb 252:16 08G 0 disk >> └─vdb1 252:17 08G 0 part <--- /dev/vdb1 is the partition >> I'd like to use as backing device for drbd >> └─drbd0 147:008G 0 disk >> >> [/etc/lvm/lvm.conf] >> write_cache_state = 0 >> use_lvmetad = 0 >> filter = [ "a|drbd.*|", "a|vda.*|", "r|.*|" ] >> >> Disabling lvmetad service >> systemctl disable lvm2-lvmetad.service >> systemctl disable lvm2-lvmetad.socket >> reboot >> >> - Creating volume group and logical volume >>
Re: [ClusterLabs] Fwd: Re: [Cluster-devel] [PATCH] dlm: prompt the user SCTP is experimental
Hi David and Mark, I compiled my own DLM kernel module with getting ride of "return -EINVAL;" line. Then did some tests with new DLM kernel module, under two-ring cluster plus "protocol=tcp" setting in /etc/dlm/dlm.conf. 1) if both networks were OK, all the tests were passed. 2) if I broken the second ring network, all the tests were passed (no any effect, since tcp protocol only uses the first ring's ip address). 3) if I broken the first ring network (e.g. ifconfig eth0 down on node3), the tests were hanged on the other nodes (e.g. node1 and node2), until node3 was rebooted manually or node3's network was back (e.g. ifconfig eth0 up on node3). 4) I switched two-ring cluster into one-ring cluster (edit /etc/corosync/corosync.conf), I broken the network from one node, this node was fenced immediately. 5) but why the node3 was not fenced in case 3)? it looks like a bug? since the tests were hanged, we have to reboot that node manually. Thanks Gang >>> > On Thu, Apr 12, 2018 at 09:31:49PM -0600, Gang He wrote: >> During this period, could we allow tcp protocol work (rather than return > error directly) under two-ring cluster? >> If the user sets using TCP protocol in command-line or dlm configuration > file, could we use the first ring IP address to work? >> I do not know why we return error directly in this case? there was any > concern before? > > You're talking about this: > > /* We don't support multi-homed hosts */ > if (dlm_local_addr[1] != NULL) { > log_print("TCP protocol can't handle multi-homed hosts, " > "try SCTP"); > return -EINVAL; > } > > I think that should be ok to remove, and just use the first addr. > Mark, do you see any reason to avoid that? > > Dave ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org