Re: [ClusterLabs] Pacemaker startup retries

Cesar Hernandez Wed, 05 Sep 2018 04:32:32 -0700

Hi

> 
> The first fencing is legitimate -- the node hasn't been seen at start-
> up, and so needs to be fenced. The second fencing will be the one of
> interest. Also, look for the result of the first fencing.


The first fencing has finished with OK, as well as the other two fencing 
operations.

Aug 31 10:59:31 [30612] node1 stonith-ng:   notice: log_operation:     
Operation 'reboot' [31075] (call 2 from crmd.30616) for host 'node2' with 
device 'st-fence_propio:0' returned: 0 (OK)

And the next log entries are:

Aug 31 10:59:31 [30612] node1 stonith-ng:   notice: remote_op_done:    
Operation reboot of node2 by node1 for crmd.30616@node1.2d7857e7: OK
Aug 31 10:59:31 [30616] node1       crmd:   notice: tengine_stonith_callback:  
Stonith operation 2/81:0:0:c64efa2b-9366-4c07-b5f1-6a2dbee79fe7: OK (0)
Aug 31 10:59:31 [30616] node1       crmd:     info: crm_get_peer:      Created 
entry 48db3347-5bbe-4cd4-b7ba-db4c697c3146/0x55adbeb587f0 for node node2/0 (2 
total)
Aug 31 10:59:31 [30616] node1       crmd:     info: peer_update_callback:      
node2 is now in unknown state
Aug 31 10:59:31 [30616] node1       crmd:     info: crm_get_peer:      Node 0 
has uuid node2
Aug 31 10:59:31 [30616] node1       crmd:   notice: crm_update_peer_state_iter: 
       crmd_peer_down: Node node2[0] - state is now lost (was (null))
Aug 31 10:59:31 [30616] node1       crmd:     info: peer_update_callback:      
node2 is now lost (was in unknown state)
Aug 31 10:59:31 [30616] node1       crmd:     info: crm_update_peer_proc:      
crmd_peer_down: Node node2[0] - all processes are now offline
Aug 31 10:59:31 [30616] node1       crmd:     info: peer_update_callback:      
Client node2/peer now has status [offline] (DC=true, changed=     1)
Aug 31 10:59:31 [30616] node1       crmd:     info: crm_update_peer_expected:  
crmd_peer_down: Node node2[0] - expected state is now down (was (null))
Aug 31 10:59:31 [30616] node1       crmd:     info: erase_status_tag:  Deleting 
xpath: //node_state[@uname='node2']/lrm
Aug 31 10:59:31 [30616] node1       crmd:     info: erase_status_tag:  Deleting 
xpath: //node_state[@uname='node2']/transient_attributes
Aug 31 10:59:31 [30616] node1       crmd:   notice: tengine_stonith_notify:    
Peer node2 was terminated (reboot) by node1 for node1: OK 
(ref=2d7857e7-7e88-482a-812f-b343218974dc) by client crmd.30616

After some other entries I see:

Aug 31 10:59:37 [30615] node1    pengine:  warning: pe_fence_node:     Node 
node2 will be fenced because the peer has not been seen by the cluster
Aug 31 10:59:37 [30615] node1    pengine:  warning: determine_online_status:   
Node node2 is unclean

The server lasts aprox 2 minutes to reboot, so it's normal to haven't been seen 
after just 6 seconds. But I don't know why the server is rebooted three times:

Aug 31 10:59:31 [30616] node1       crmd:   notice: tengine_stonith_notify:     
Peer node2 was terminated (reboot) by node1 for node1: OK 
(ref=2d7857e7-7e88-482a-812f-b343218974dc) by client crmd.30616
....
Aug 31 10:59:53 [30616] node1       crmd:   notice: tengine_stonith_notify:     
Peer node2 was terminated (reboot) by node1 for node1: OK 
(ref=2835cb08-362d-4d39-9133-3a7dcefb913c) by client crmd.30616
....
Aug 31 11:00:05 [30616] node1       crmd:   notice: tengine_stonith_notify:     
Peer node2 was terminated (reboot) by node1 for node1: OK 
(ref=17931f5b-76ea-4e3a-a792-535cea50afca) by client crmd.30616


After the last message, I only see it stops fencing and start resources:


Aug 31 11:00:05 [30616] node1       crmd:   notice: tengine_stonith_notify:    
Peer node2 was terminated (reboot) by node1 for node1: OK 
(ref=17931f5b-76ea-4e3a-a792-535cea50afca) by client crmd.30616
Aug 31 11:00:05 [30611] node1        cib:     info: cib_process_request:       
Completed cib_modify operation for section status: OK (rc=0, 
origin=local/crmd/68, version=0.60.30)
Aug 31 11:00:05 [30616] node1       crmd:     info: erase_status_tag:  Deleting 
xpath: //node_state[@uname='node2']/lrm
Aug 31 11:00:05 [30616] node1       crmd:     info: erase_status_tag:  Deleting 
xpath: //node_state[@uname='node2']/transient_attributes
Aug 31 11:00:05 [30616] node1       crmd:     info: cib_fencing_updated:       
Fencing update 68 for node2: complete
Aug 31 11:00:05 [30616] node1       crmd:   notice: te_rsc_command:    
Initiating action 66: start p_fs_database_start_0 on node1 (local)
Aug 31 11:00:05 [30616] node1       crmd:     info: do_lrm_rsc_op:     
Performing key=66:2:0:c64efa2b-9366-4c07-b5f1-6a2dbee79fe7 
op=p_fs_database_start_0
Aug 31 11:00:05 [30613] node1       lrmd:     info: log_execute:       
executing - rsc:p_fs_database action:start call_id:70
Aug 31 11:00:05 [30616] node1       crmd:   notice: te_rsc_command:    
Initiating action 77: start p_fs_datosweb_start_0 on node1 (local)
Aug 31 11:00:05 [30616] node1       crmd:     info: do_lrm_rsc_op:     
Performing key=77:2:0:c64efa2b-9366-4c07-b5f1-6a2dbee79fe7 
op=p_fs_datosweb_start_0
Aug 31 11:00:05 [30613] node1       lrmd:     info: log_execute:       
executing - rsc:p_fs_datosweb action:start call_id:71
Aug 31 11:00:05 [30611] node1        cib:     info: cib_process_request:       
Completed cib_delete operation for section //node_state[@uname='node2']/lrm: OK 
(rc=0, origin=local/crmd/69, version=0.60.30)
Aug 31 11:00:05 [30611] node1        cib:     info: cib_process_request:       
Completed cib_delete operation for section 
//node_state[@uname='node2']/transient_attributes: OK (rc=0, 
origin=local/crmd/70, version=0.60.30)
Aug 31 11:00:05 [30611] node1        cib:     info: cib_process_request:       
Completed cib_modify operation for section status: OK (rc=0, 
origin=local/crmd/71, version=0.60.30)
Aug 31 11:00:05 [30616] node1       crmd:     info: cib_fencing_updated:       
Fencing update 71 for node2: complete
Aug 31 11:00:05 [30611] node1        cib:     info: cib_process_request:       
Completed cib_delete operation for section //node_state[@uname='node2']/lrm: OK 
(rc=0, origin=local/crmd/72, version=0.60.30)
Aug 31 11:00:05 [30611] node1        cib:     info: cib_process_request:       
Completed cib_delete operation for section 
//node_state[@uname='node2']/transient_attributes: OK (rc=0, 
origin=local/crmd/73, version=0.60.30)
Filesystem(p_fs_datosweb)[962]: 2018/08/31_11:00:05 INFO: Running start for 
/dev/drbd/by-res/datoswebstorage on /mnt/datosweb
Filesystem(p_fs_database)[961]: 2018/08/31_11:00:05 INFO: Running start for 
/dev/drbd/by-res/databasestorage on /mnt/database


......


Can you give me more clues?

Thanks a lot

Cesar

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker startup retries

Reply via email to