On 03/23/2016 12:01 PM, Vladislav Bogdanov wrote: > 23.03.2016 19:52, Vladislav Bogdanov wrote: >> 23.03.2016 19:39, Ken Gaillot wrote: >>> On 03/23/2016 07:35 AM, Vladislav Bogdanov wrote: >>>> Hi! >>>> >>>> It seems like atomic attrd in post-1.1.14 (eb89393) does not >>>> fully clean node cache after node is removed.
I haven't forgotten, this was a tricky one :-) I believe this regression was introduced in da17fd0, which clears the node's attribute *values* when purging the node, but not the value *structures* that contain the node name and ID. That was intended as a fix for when nodes leave and rejoin. However the same affected function is used to handle "crm_node -R" requests, which should cause complete removal. I hope to have a fix soon. Note that the behavior may still occur if "crm_node -R" is not called after reloading corosync. >>> Is this a regression? Or have you only tried it with this version? >> >> Only with this one. >> >>> >>>> After our QA guys remove node wa-test-server-ha-03 from a two-node >>>> cluster: >>>> * stop pacemaker and corosync on wa-test-server-ha-03 >>>> * remove node wa-test-server-ha-03 from corosync nodelist on >>>> wa-test-server-ha-04 >>>> * tune votequorum settings >>>> * reload corosync on wa-test-server-ha-04 >>>> * remove node from pacemaker on wa-test-server-ha-04 >>>> * delete everything from /var/lib/pacemaker/cib on wa-test-server-ha-03 >>>> , and then join it with the different corosync ID (but with the same >>>> node name), >>>> we see the following in logs: >>>> >>>> Leave node 1 (wa-test-server-ha-03): >>>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: >>>> crm_update_peer_proc: Node wa-test-server-ha-03[1] - state is now >>>> lost (was member) >>>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Removing >>>> all wa-test-server-ha-03 (1) attributes for attrd_peer_change_cb >>>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Lost >>>> attribute writer wa-test-server-ha-03 >>>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Removing >>>> wa-test-server-ha-03/1 from the membership list >>>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Purged 1 >>>> peers with id=1 and/or uname=wa-test-server-ha-03 from the membership >>>> cache >>>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: >>>> Processing peer-remove from wa-test-server-ha-04: wa-test-server-ha-03 0 >>>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: Removing >>>> all wa-test-server-ha-03 (0) attributes for wa-test-server-ha-04 >>>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: Removing >>>> wa-test-server-ha-03/1 from the membership list >>>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: Purged 1 >>>> peers with id=0 and/or uname=wa-test-server-ha-03 from the membership >>>> cache >>>> >>>> Join node 3 (the same one, wa-test-server-ha-03, but ID differs): >>>> Mar 23 04:21:23 wa-test-server-ha-04 attrd[25962]: notice: >>>> crm_update_peer_proc: Node wa-test-server-ha-03[3] - state is now >>>> member (was (null)) >>>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]: warning: >>>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 - >>>> a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296 >>>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]: warning: >>>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 - >>>> 6c18faa1-f8c2-4b0c-907c-20db450e2e79 >>>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]: crit: Node 1 >>>> and 3 share the same name 'wa-test-server-ha-03' >>> >>> It took me a while to understand the above combination of messages. This >>> is not node 3 joining. This is node 1 joining after node 3 has already >>> been seen. >> >> Hmmm... >> corosync.conf and corosync-cmapctl both say it is 3 >> Also, cib lists it as 3 and lrmd puts its status records under 3. > > I mean: > > <node_state id="3" uname="wa-test-server-ha-03" crmd="online" > crm-debug-origin="do_update_resource" in_ccm="true" join="member" > expected="member"> > <lrm id="3"> > <lrm_resources> > ... > </lrm_resources> > </lrm> > </node_state> > <node_state id="1"> > <transient_attributes id="1"> > <instance_attributes id="status-1"> > <nvpair id="status-1-shutdown" name="shutdown" value="0"/> > <nvpair id="status-1-master-rabbitmq-local" > name="master-rabbitmq-local" value="1"/> > <nvpair id="status-1-master-meta-0-0-drbd" > name="master-meta-0-0-drbd" value="10000"/> > <nvpair id="status-1-master-staging-0-0-drbd" > name="master-staging-0-0-drbd" value="10000"/> > <nvpair id="status-1-rabbit-start-time" name="rabbit-start-time" > value="1458732136"/> > </instance_attributes> > </transient_attributes> > </node_state> > >> >> Actually issue is that drbd resources are not promoted because their >> master attributes go to section with node-id 1. And that is the only >> reason why we found that. Everything not related to volatile attributes >> works well. >> >>> >>> The warnings are a complete dump of the peer cache. So you can see that >>> wa-test-server-ha-03 is listed only once, with id 3. >>> >>> The critical message ("Node 1 and 3") lists the new id first and the >>> found ID second. So id 1 is what it's trying to add to the cache. >> >> But there is also 'Node 'wa-test-server-ha-03' has changed its ID from 1 >> to 3' - it goes first. Does that matter? >> >>> >>> Did you update the node ID in corosync.conf on *both* nodes? >> >> Sure. >> It is automatically copied to a node being joined. >> >>> >>>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: notice: Node >>>> 'wa-test-server-ha-03' has changed its ID from 1 to 3 >>>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: warning: >>>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 - >>>> a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296 >>>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: warning: >>>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 - >>>> 6c18faa1-f8c2-4b0c-907c-20db450e2e79 >>>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: crit: Node 1 >>>> and 3 share the same name 'wa-test-server-ha-03' >>>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: notice: Node >>>> 'wa-test-server-ha-03' has changed its ID from 1 to 3 >>>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: warning: >>>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 - >>>> a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296 >>>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: warning: >>>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 - >>>> 6c18faa1-f8c2-4b0c-907c-20db450e2e79 >>>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: crit: Node 1 >>>> and 3 share the same name 'wa-test-server-ha-03' >>>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: notice: Node >>>> 'wa-test-server-ha-03' has changed its ID from 3 to 1 >>>> ... >>>> >>>> On the node being joined: >>>> Mar 23 04:21:23 wa-test-server-ha-03 attrd[15260]: notice: >>>> Connecting to cluster infrastructure: corosync >>>> Mar 23 04:21:23 wa-test-server-ha-03 attrd[15260]: notice: >>>> crm_update_peer_proc: Node wa-test-server-ha-03[3] - state is now >>>> member (was (null)) >>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: notice: >>>> crm_update_peer_proc: Node wa-test-server-ha-04[2] - state is now >>>> member (was (null)) >>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: notice: Recorded >>>> attribute writer: wa-test-server-ha-04 >>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: notice: >>>> Processing sync-response from wa-test-server-ha-04 >>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: warning: >>>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0xdfe620 - >>>> ad08ca96-295a-4fa4-99f9-8c8a2d0b6ac0 >>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: warning: >>>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0xd7ae20 - >>>> f85bdc4b-a3ee-47ff-bdd5-7c1dcf9fe97c >>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: crit: Node 1 >>>> and 3 share the same name 'wa-test-server-ha-03' >>>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]: notice: Node >>>> 'wa-test-server-ha-03' has changed its ID from 1 to 3 >>>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]: notice: Updating >>>> all attributes after cib_refresh_notify event >>>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]: notice: Updating >>>> all attributes after cib_refresh_notify event >>>> >>>> >>>> CIB status section after that contains entries for three nodes, with >>>> IDs 1, 2 and 3. >>>> For node 2 (that which remained) there are both transient attributes >>>> and lrm statuses >>>> For node 1 (that have been removed) - only transient attributes >>>> For node 3 (newly joined) - only lrm statuses >>>> >>>> That makes me think that not everything is removed (stale caches?) >>>> from attrd during node leave. >>>> >>>> >>>> Is there some other information I can provide to solve this issue? >>>> >>>> Best, >>>> Vladislav _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org