Thanks Ken very much for your helpful infomation. I am now blocking on I can't see the pacemaker DC do any further start/promote etc action on my resource agents, no helpful logs founded.
So my first question is that in what kind of situation DC will decide do call start action? does the monitor operation need to be return OCF_SUCCESS? in my case, it will return OCF_NOT_RUNNING, and the monitor operation is not being called any more, which should be wrong as I felt that it should be called intervally. The resource agent monitor logistic: In the xx_monitor function it will call xx_update, and there always hit "$CRM_MASTER -D;;" , what does it usually mean? will it stopped that start operation being called? ovsdb_server_master_update() { ocf_log info "ovsdb_server_master_update: $1}" case $1 in $OCF_SUCCESS) $CRM_MASTER -v ${slave_score};; $OCF_RUNNING_MASTER) $CRM_MASTER -v ${master_score};; #*) $CRM_MASTER -D;; esac ocf_log info "ovsdb_server_master_update end}" } ovsdb_server_monitor() { ocf_log info "ovsdb_server_monitor" ovsdb_server_check_status rc=$? ovsdb_server_master_update $rc ocf_log info "monitor is going to return $rc" return $rc } Below is my cluster configuration: 1. First I have an vip set. [root@node-1 ~]# pcs resource show vip__management_old (ocf::es:ns_IPaddr2): Started node-1.domain.tld 2. Use pcs to create ovndb-servers and constraint [root@node-1 ~]# pcs resource create tst-ovndb ocf:ovn:ovndb-servers manage_northd=yes master_ip=192.168.0.2 nb_master_port=6641 sb_master_port=6642 master ([root@node-1 ~]# pcs resource meta tst-ovndb-master notify=true Error: unable to find a resource/clone/master/group: tst-ovndb-master) ## returned error, so I changed into below command. [root@node-1 ~]# pcs resource master tst-ovndb-master tst-ovndb notify=true [root@node-1 ~]# pcs constraint colocation add master tst-ovndb-master with vip__management_old 3. pcs status [root@node-1 ~]# pcs status vip__management_old (ocf::es:ns_IPaddr2): Started node-1.domain.tld Master/Slave Set: tst-ovndb-master [tst-ovndb] Stopped: [ node-1.domain.tld node-2.domain.tld node-3.domain.tld ] 4. pcs resource show XXX [root@node-1 ~]# pcs resource show vip__management_old Resource: vip__management_old (class=ocf provider=es type=ns_IPaddr2) Attributes: nic=br-mgmt base_veth=br-mgmt-hapr ns_veth=hapr-m ip=192.168.0.2 iflabel=ka cidr_netmask=24 ns=haproxy gateway=none gateway_metric=0 iptables_start_rules=false iptables_stop_rules=false iptables_comment=default-comment Meta Attrs: migration-threshold=3 failure-timeout=60 resource-stickiness=1 Operations: monitor interval=3 timeout=30 (vip__management_old-monitor-3) start interval=0 timeout=30 (vip__management_old-start-0) stop interval=0 timeout=30 (vip__management_old-stop-0) [root@node-1 ~]# pcs resource show tst-ovndb-master Master: tst-ovndb-master Meta Attrs: notify=true Resource: tst-ovndb (class=ocf provider=ovn type=ovndb-servers) Attributes: manage_northd=yes master_ip=192.168.0.2 nb_master_port=6641 sb_master_port=6642 Operations: start interval=0s timeout=30s (tst-ovndb-start-timeout-30s) stop interval=0s timeout=20s (tst-ovndb-stop-timeout-20s) promote interval=0s timeout=50s (tst-ovndb-promote-timeout- 50s) demote interval=0s timeout=50s (tst-ovndb-demote-timeout-50s) monitor interval=30s timeout=20s (tst-ovndb-monitor-interval- 30s) monitor interval=10s role=Master timeout=20s (tst-ovndb-monitor-interval-10s-role-Master) monitor interval=30s role=Slave timeout=20s (tst-ovndb-monitor-interval-30s-role-Slave) colocation colocation-tst-ovndb-master-vip__management_old-INFINITY inf: tst-ovndb-master:Master vip__management_old:Started 5. I have put log in every ovndb-servers op, seems only the monitor op is being called, no promoted by the pacemaker DC: <30>Nov 30 15:22:19 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO: ovsdb_server_monitor <30>Nov 30 15:22:19 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO: ovsdb_server_check_status <30>Nov 30 15:22:19 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO: return OCFOCF_NOT_RUNNINGG <30>Nov 30 15:22:20 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO: ovsdb_server_master_update: 7} <30>Nov 30 15:22:20 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO: ovsdb_server_master_update end} <30>Nov 30 15:22:20 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO: monitor is going to return 7 <30>Nov 30 15:22:20 node-1 ovndb-servers(undef)[2980970]: INFO: metadata exit OCF_SUCCESS} 6. The cluster property: property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.12-a14efad \ cluster-infrastructure=corosync \ no-quorum-policy=ignore \ stonith-enabled=false \ symmetric-cluster=false \ last-lrm-refresh=1511802933 Thank you very much for any help. Hui. Date: Mon, 27 Nov 2017 12:07:57 -0600 From: Ken Gaillot <kgail...@redhat.com> To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org>, jpoko...@redhat.com Subject: Re: [ClusterLabs] pcs create master/slave resource doesn't work Message-ID: <1511806077.5194.6.ca...@redhat.com> Content-Type: text/plain; charset="UTF-8" On Fri, 2017-11-24 at 18:00 +0800, Hui Xiang wrote: > Jan, > > ? Very appreciated on your help, I am getting further more, but still > it looks very strange. > > 1. To use "debug-promote", I upgrade pacemaker from 1.12 to 1.16, pcs > to 0.9.160. > > 2. Recreate resource with below commands > pcs resource create ovndb_servers ocf:ovn:ovndb-servers \ > ? master_ip=192.168.0.99 \ > ? op monitor interval="10s" \ > ? op monitor interval="11s" role=Master > pcs resource master ovndb_servers-master ovndb_servers \ > ? meta notify="true" master-max="1" master-node-max="1" clone-max="3" > clone-node-max="1" > pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=192.168.0.99 \ > ? ? op monitor interval=10s > pcs constraint colocation add VirtualIP with master ovndb_servers- > master \ > ? score=INFINITY > > 3. pcs status > ?Master/Slave Set: ovndb_servers-master [ovndb_servers] > ? ? ?Stopped: [ node-1.domain.tld node-2.domain.tld node-3.domain.tld > ] > ?VirtualIP (ocf::heartbeat:IPaddr2): Stopped > > 4. Manually run 'debug-start' on 3 nodes and 'debug-promote' on one > of nodes > run below on [ node-1.domain.tld node-2.domain.tld node-3.domain.tld > ] > # pcs resource debug-start ovndb_servers --full > run below on [ node-1.domain.tld ] > # pcs resource debug-promote ovndb_servers --full Before running debug-* commands, I'd unmanage the resource or put the cluster in maintenance mode, so Pacemaker doesn't try to "correct" your actions. > > 5. pcs status > ?Master/Slave Set: ovndb_servers-master [ovndb_servers] > ? ? ?Stopped: [ node-1.domain.tld node-2.domain.tld node-3.domain.tld > ] > ?VirtualIP (ocf::heartbeat:IPaddr2): Stopped > > 6. However I have seen that one of ovndb_servers has been indeed > promoted as master, but pcs status still showed all 'stopped' > what am I missing? It's hard to tell from these logs. It's possible the resource agent's monitor command is not exiting with the expected status values: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemake r_Explained/index.html#_requirements_for_multi_state_resource_agents One of the nodes will be elected the DC, meaning it coordinates the cluster's actions. The DC's logs will have more "pengine:" messages, with each action that needs to be taken (e.g. "* Start <rsc> <node>"). You can look through those actions to see what the cluster decided to do -- whether the resources were ever started, whether any was promoted, and whether any were explicitly stopped. > ?>? stderr: + 17:45:59: ocf_log:327: __OCF_MSG='ovndb_servers: > Promoting node-1.domain.tld as the master' > ?>? stderr: + 17:45:59: ocf_log:329: case "${__OCF_PRIO}" in > ?>? stderr: + 17:45:59: ocf_log:333: __OCF_PRIO=INFO > ?>? stderr: + 17:45:59: ocf_log:338: '[' INFO = DEBUG ']' > ?>? stderr: + 17:45:59: ocf_log:341: ha_log 'INFO: ovndb_servers: > Promoting node-1.domain.tld as the master' > ?>? stderr: + 17:45:59: ha_log:253: __ha_log 'INFO: ovndb_servers: > Promoting node-1.domain.tld as the master' > ?>? stderr: + 17:45:59: __ha_log:185: local ignore_stderr=false > ?>? stderr: + 17:45:59: __ha_log:186: local loglevel > ?>? stderr: + 17:45:59: __ha_log:188: '[' 'xINFO: ovndb_servers: > Promoting node-1.domain.tld as the master' = x--ignore-stderr ']' > ?>? stderr: + 17:45:59: __ha_log:190: '[' none = '' ']' > ?>? stderr: + 17:45:59: __ha_log:192: tty > ?>? stderr: + 17:45:59: __ha_log:193: '[' x = x0 -a x = xdebug ']' > ?>? stderr: + 17:45:59: __ha_log:195: '[' false = true ']' > ?>? stderr: + 17:45:59: __ha_log:199: '[' '' ']' > ?>? stderr: + 17:45:59: __ha_log:202: echo 'INFO: ovndb_servers: > Promoting node-1.domain.tld as the master' > ?>? stderr: INFO: ovndb_servers: Promoting node-1.domain.tld as the > master > ?>? stderr: + 17:45:59: __ha_log:204: return 0 > ?>? stderr: + 17:45:59: ovsdb_server_promote:378: > /usr/sbin/crm_attribute --type crm_config --name OVN_REPL_INFO -s > ovn_ovsdb_master_server -v node-1.domain.tld > ?>? stderr: + 17:45:59: ovsdb_server_promote:379: > ovsdb_server_master_update 8 > ?>? stderr: + 17:45:59: ovsdb_server_master_update:214: case $1 in > ?>? stderr: + 17:45:59: ovsdb_server_master_update:218: > /usr/sbin/crm_master -l reboot -v 10 > ?>? stderr: + 17:45:59: ovsdb_server_promote:380: return 0 > ?>? stderr: + 17:45:59: 458: rc=0 > ?>? stderr: + 17:45:59: 459: exit 0 > > > On 23/11/17 23:52 +0800, Hui Xiang wrote: > > I am working on HA with 3-nodes, which has below configurations: > >? > > """ > > pcs resource create ovndb_servers ocf:ovn:ovndb-servers \ > >???master_ip=168.254.101.2 \ > >???op monitor interval="10s" \ > >???op monitor interval="11s" role=Master > > pcs resource master ovndb_servers-master ovndb_servers \ > >???meta notify="true" master-max="1" master-node-max="1" clone- > max="3" > > clone-node-max="1" > > pcs resource create VirtualIP ocf:heartbeat:IPaddr2 > ip=168.254.101.2 \ > >?????op monitor interval=10s > > pcs constraint order promote ovndb_servers-master then VirtualIP > > pcs constraint colocation add VirtualIP with master ovndb_servers- > master \ > >???score=INFINITY > > """ > > (Out of curiosity, this looks like a mix of output from? > pcs config export pcs-commands [or clufter cib2pcscmd -s] > and manual editing.??Is this a good guess?) > It's the output of "pcs status". > > >???However, after setting it as above, the master is not being > selected, all > > are stopped, from pacemaker log, node-1 has been chosen as the > master, I am > > confuse where is wrong, can anybody give a help, it would be very > > appreciated. > >? > >? > >??Master/Slave Set: ovndb_servers-master [ovndb_servers] > >??????Stopped: [ node-1.domain.tld node-2.domain.tld node- > 3.domain.tld ] > >??VirtualIP (ocf::heartbeat:IPaddr2): Stopped > >? > >? > > # pacemaker log > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++ /cib/configuration/resources:??<primitive > class="ocf" > > id="ovndb_servers" provider="ovn" type="ovndb-servers"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: > ++??????????????????????????????????<instance_attributes > > id="ovndb_servers-instance_attributes"> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++????????????????????????????????????<nvpair > > id="ovndb_servers-instance_attributes-master_ip" name="master_ip" > > value="168.254.101.2"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: > ++??????????????????????????????????</instance_attributes> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++??????????????????????????????????<operations> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++????????????????????????????????????<op > > id="ovndb_servers-start-timeout-30s" interval="0s" name="start" > > timeout="30s"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++????????????????????????????????????<op > > id="ovndb_servers-stop-timeout-20s" interval="0s" name="stop" > > timeout="20s"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++????????????????????????????????????<op > > id="ovndb_servers-promote-timeout-50s" interval="0s" name="promote" > > timeout="50s"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++????????????????????????????????????<op > > id="ovndb_servers-demote-timeout-50s" interval="0s" name="demote" > > timeout="50s"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++????????????????????????????????????<op > > id="ovndb_servers-monitor-interval-10s" interval="10s" > name="monitor"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++????????????????????????????????????<op > > id="ovndb_servers-monitor-interval-11s-role-Master" interval="11s" > > name="monitor" role="Master"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++??????????????????????????????????</operations> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++????????????????????????????????</primitive> > >? > > Nov 23 23:06:03 [665249] node-1.domain.tld??????attrd:?????info: > > attrd_peer_update: Setting master-ovndb_servers[node-1.domain.tld]: > (null) > > -> 5 from node-1.domain.tld > > If it's probable your ocf:ovn:ovndb-servers agent in master mode can > run something like "attrd_updater -n master-ovndb_servers -U 5", then > it was indeed launched OK, and if it does not continue to run as > expected, there may be a problem with the agent itself. > > no change. > You can try running "pcs resource debug-promote ovndb_servers --full" > to examine the executation details (assuming the agent responds to > OCF_TRACE_RA=1 environment variable, which is what shell-based > agents built on top ocf-shellfuncs sourcable shell library from > resource-agents project, hence incl. also agents it ships, > customarily do). > Yes, thank, it's helpful. > > > Nov 23 23:06:03 [665251] node-1.domain.tld???????crmd:???notice: > > process_lrm_event: Operation ovndb_servers_monitor_0: ok > > (node=node-1.domain.tld, call=185, rc=0, cib-update=88, > confirmed=true) > > <29>Nov 23 23:06:03 node-1 crmd[665251]:???notice: > process_lrm_event: > > Operation ovndb_servers_monitor_0: ok (node=node-1.domain.tld, > call=185, > > rc=0, cib-update=88, confirmed=true) > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: Diff: --- 0.630.2 2 > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: Diff: +++ 0.630.3 (null) > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: +??/cib:??@num_updates=3 > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_perform_op: ++ > > > /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instanc > e_attributes[@id='status-1']: > > <nvpair id="status-1-master-ovndb_servers" name="master- > ovndb_servers" > > value="5"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info: > > cib_process_request: Completed cib_modify operation for section > status: OK > > (rc=0, origin=node-3.domain.tld/attrd/80, version=0.630.3) > > Also depends if there's anything interesting after this point... > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot <kgail...@redhat.com>
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org