On 03/03/17 12:59, Ulrich Windl wrote:
> Hello!
> 
> After Update and reboot of 2nd of three nodes (SLES11 SP4) I see a 
> "cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error retrying" message 
> when I expected the node to joint the cluster. What can be the reasons for 
> this?
> In fact this seems to have killed cluster communication, because I saw that 
> "DLM start" timed out. The other nodes were unable to use DLM during that 
> time (while the node could not join).
> 
> I saw that corosync starts before the firewall in SLES11 SP4; maybe that's a 
> problem.
> 

Could be. It sounds like something hasn't started properly and that's
most usually caused by either the network being down or ports
unavailable. This can cause corosync to not know its local node name (or
match what's in the config file) or DLM to fail to start.

> I tried an "rcopenais stop" of the problem node, which in tun caused a node 
> fence (DLM stop timed out, too), and then the other nodes were able to 
> communicate again. During boot the problem node was able to join the cluster 
> as before. In the meantime I had also updated the third node without a 
> problem, so it looks like a rare race condition to me.
> ANy insights?
> 
> Could the problem be related to one of these messages?
> crmd[3656]:   notice: get_node_name: Could not obtain a node name for classic 
> openais (with plugin) nodeid 739512321
> corosync[3646]:  [pcmk  ] info: update_member: 0x64bc90 Node 739512325 
> ((null)) born on: 3352
> stonith-ng[3652]:   notice: get_node_name: Could not obtain a node name for 
> classic openais (with plugin) nodeid 739512321
> crmd[3656]:   notice: get_node_name: Could not obtain a node name for classic 
> openais (with plugin) nodeid 739512330
> cib[3651]:   notice: get_node_name: Could not obtain a node name for classic 
> openais (with plugin) nodeid 739512321
> cib[3651]:   notice: crm_update_peer_state: plugin_handle_membership: Node 
> (null)[739512321] - state is now member (was (null))
> 
> crmd:     info: crm_get_peer:     Created entry 
> 8a7d6859-5ab1-404b-95a0-ba28064763fb/0x7a81f0 for node (null)/739512321 (2 
> total)
> crmd:     info: crm_get_peer:     Cannot obtain a UUID for node 
> 739512321/(null)
> crmd:     info: crm_update_peer:  plugin_handle_membership: Node (null): 
> id=739512321 state=member addr=r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)  (new) 
> votes=0 born=0 seen=3352 proc=00000000000000000000000000000000
> 


Those messages are all effect rather than cause so it's hard to say.

If the cluster starts up when you attempt it manually after the system
is booted, then it's probably a startup race with something. Network
Manager is often a culprit here, though I don't know SLES.


Chrissie

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to