On Wed, 2018-09-05 at 13:31 +0200, Cesar Hernandez wrote: > Hi > > > > > The first fencing is legitimate -- the node hasn't been seen at > > start- > > up, and so needs to be fenced. The second fencing will be the one > > of > > interest. Also, look for the result of the first fencing. > > The first fencing has finished with OK, as well as the other two > fencing operations. > > Aug 31 10:59:31 [30612] node1 stonith-ng: notice: > log_operation: Operation 'reboot' [31075] (call 2 from > crmd.30616) for host 'node2' with device 'st-fence_propio:0' > returned: 0 (OK) > > And the next log entries are: > > Aug 31 10:59:31 [30612] node1 stonith-ng: notice: > remote_op_done: Operation reboot of node2 by node1 for crmd.30616@ > node1.2d7857e7: OK > Aug 31 10:59:31 [30616] node1 crmd: notice: > tengine_stonith_callback: Stonith operation 2/81:0:0:c64efa2b-9366- > 4c07-b5f1-6a2dbee79fe7: OK (0) > Aug 31 10:59:31 [30616] node1 crmd: info: > crm_get_peer: Created entry 48db3347-5bbe-4cd4-b7ba- > db4c697c3146/0x55adbeb587f0 for node node2/0 (2 total) > Aug 31 10:59:31 [30616] node1 crmd: info: > peer_update_callback: node2 is now in unknown state > Aug 31 10:59:31 [30616] node1 crmd: info: > crm_get_peer: Node 0 has uuid node2 > Aug 31 10:59:31 [30616] node1 crmd: notice: > crm_update_peer_state_iter: crmd_peer_down: Node node2[0] - > state is now lost (was (null)) > Aug 31 10:59:31 [30616] node1 crmd: info: > peer_update_callback: node2 is now lost (was in unknown state) > Aug 31 10:59:31 [30616] node1 crmd: info: > crm_update_peer_proc: crmd_peer_down: Node node2[0] - all > processes are now offline > Aug 31 10:59:31 [30616] node1 crmd: info: > peer_update_callback: Client node2/peer now has status [offline] > (DC=true, changed= 1) > Aug 31 10:59:31 [30616] node1 crmd: info: > crm_update_peer_expected: crmd_peer_down: Node node2[0] - expected > state is now down (was (null)) > Aug 31 10:59:31 [30616] node1 crmd: info: > erase_status_tag: Deleting xpath: //node_state[@uname='node2']/lrm > Aug 31 10:59:31 [30616] node1 crmd: info: > erase_status_tag: Deleting xpath: > //node_state[@uname='node2']/transient_attributes > Aug 31 10:59:31 [30616] node1 crmd: notice: > tengine_stonith_notify: Peer node2 was terminated (reboot) by > node1 for node1: OK (ref=2d7857e7-7e88-482a-812f-b343218974dc) by > client crmd.30616 > > After some other entries I see: > > Aug 31 10:59:37 [30615] node1 pengine: warning: > pe_fence_node: Node node2 will be fenced because the peer has not > been seen by the cluster
Ah, this rings a bell. Despite having fenced the node, the cluster still considers the node unseen. That was a regression in 1.1.14 that was fixed in 1.1.15. :-( > Aug 31 10:59:37 [30615] node1 pengine: warning: > determine_online_status: Node node2 is unclean > > The server lasts aprox 2 minutes to reboot, so it's normal to haven't > been seen after just 6 seconds. But I don't know why the server is > rebooted three times: > > Aug 31 10:59:31 [30616] node1 crmd: notice: > tengine_stonith_notify: Peer node2 was terminated (reboot) by > node1 for node1: OK (ref=2d7857e7-7e88-482a-812f-b343218974dc) by > client crmd.30616 > .... > Aug 31 10:59:53 [30616] node1 crmd: notice: > tengine_stonith_notify: Peer node2 was terminated (reboot) by > node1 for node1: OK (ref=2835cb08-362d-4d39-9133-3a7dcefb913c) by > client crmd.30616 > .... > Aug 31 11:00:05 [30616] node1 crmd: notice: > tengine_stonith_notify: Peer node2 was terminated (reboot) by > node1 for node1: OK (ref=17931f5b-76ea-4e3a-a792-535cea50afca) by > client crmd.30616 > > > After the last message, I only see it stops fencing and start > resources: > > > Aug 31 11:00:05 [30616] node1 crmd: notice: > tengine_stonith_notify: Peer node2 was terminated (reboot) by > node1 for node1: OK (ref=17931f5b-76ea-4e3a-a792-535cea50afca) by > client crmd.30616 > Aug 31 11:00:05 [30611] node1 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=local/crmd/68, version=0.60.30) > Aug 31 11:00:05 [30616] node1 crmd: info: > erase_status_tag: Deleting xpath: //node_state[@uname='node2']/lrm > Aug 31 11:00:05 [30616] node1 crmd: info: > erase_status_tag: Deleting xpath: > //node_state[@uname='node2']/transient_attributes > Aug 31 11:00:05 [30616] node1 crmd: info: > cib_fencing_updated: Fencing update 68 for node2: complete > Aug 31 11:00:05 [30616] node1 crmd: notice: > te_rsc_command: Initiating action 66: start p_fs_database_start_0 > on node1 (local) > Aug 31 11:00:05 [30616] node1 crmd: info: > do_lrm_rsc_op: Performing key=66:2:0:c64efa2b-9366-4c07-b5f1- > 6a2dbee79fe7 op=p_fs_database_start_0 > Aug 31 11:00:05 [30613] node1 lrmd: info: > log_execute: executing - rsc:p_fs_database action:start > call_id:70 > Aug 31 11:00:05 [30616] node1 crmd: notice: > te_rsc_command: Initiating action 77: start p_fs_datosweb_start_0 > on node1 (local) > Aug 31 11:00:05 [30616] node1 crmd: info: > do_lrm_rsc_op: Performing key=77:2:0:c64efa2b-9366-4c07-b5f1- > 6a2dbee79fe7 op=p_fs_datosweb_start_0 > Aug 31 11:00:05 [30613] node1 lrmd: info: > log_execute: executing - rsc:p_fs_datosweb action:start > call_id:71 > Aug 31 11:00:05 [30611] node1 cib: info: > cib_process_request: Completed cib_delete operation for section > //node_state[@uname='node2']/lrm: OK (rc=0, origin=local/crmd/69, > version=0.60.30) > Aug 31 11:00:05 [30611] node1 cib: info: > cib_process_request: Completed cib_delete operation for section > //node_state[@uname='node2']/transient_attributes: OK (rc=0, > origin=local/crmd/70, version=0.60.30) > Aug 31 11:00:05 [30611] node1 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=local/crmd/71, version=0.60.30) > Aug 31 11:00:05 [30616] node1 crmd: info: > cib_fencing_updated: Fencing update 71 for node2: complete > Aug 31 11:00:05 [30611] node1 cib: info: > cib_process_request: Completed cib_delete operation for section > //node_state[@uname='node2']/lrm: OK (rc=0, origin=local/crmd/72, > version=0.60.30) > Aug 31 11:00:05 [30611] node1 cib: info: > cib_process_request: Completed cib_delete operation for section > //node_state[@uname='node2']/transient_attributes: OK (rc=0, > origin=local/crmd/73, version=0.60.30) > Filesystem(p_fs_datosweb)[962]: 2018/08/31_11:00:05 INFO: Running > start for /dev/drbd/by-res/datoswebstorage on /mnt/datosweb > Filesystem(p_fs_database)[961]: 2018/08/31_11:00:05 INFO: Running > start for /dev/drbd/by-res/databasestorage on /mnt/database > > > ...... > > > Can you give me more clues? > > Thanks a lot > > Cesar > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org