On 25/09/15 00:09, Digimer wrote: > I had a RHEL 6.7, cman + rgmanager cluster that I've built many times > before. Oddly, I just hit this error: > > ==== > [root@node2 ~]# /etc/init.d/clvmd start > Starting clvmd: clvmd could not connect to cluster manager > Consult syslog for more information > ==== > > syslog: > ==== > Sep 24 23:00:30 node2 kernel: dlm: Using SCTP for communications > Sep 24 23:00:30 node2 clvmd: Unable to create DLM lockspace for CLVM: > Address already in use > Sep 24 23:00:30 node2 kernel: dlm: Can't bind to port 21064 addr number 1
This seems to be the key to it. I can't imagine what else would be using port 21064 (apart from DLM using TCP as well as SCTP but I don' think that's possible!) Have a look in netstat and see what else is using that port. It could be that the socket was in use and is taking a while to shut down so it might go away on its own too. Chrissie > Sep 24 23:00:30 node2 kernel: dlm: cannot start dlm lowcomms -98 > ==== > > There are no iptables rules: > > ==== > [root@node2 ~]# iptables-save > ==== > > And there are no DLM lockspaces, either: > > ==== > [root@node2 ~]# dlm_tool ls > [root@node2 ~]# > ==== > > I tried withdrawing the node from the cluster entirely, the started cman > alone and tried to start clvmd, same issue. > > Pinging between the two nodes seems OK: > > ==== > [root@node1 ~]# uname -n > node1.ccrs.bcn > [root@node1 ~]# ping -c 2 node1.ccrs.bcn > PING node1.bcn (10.20.10.1) 56(84) bytes of data. > 64 bytes from node1.bcn (10.20.10.1): icmp_seq=1 ttl=64 time=0.015 ms > 64 bytes from node1.bcn (10.20.10.1): icmp_seq=2 ttl=64 time=0.017 ms > > --- node1.bcn ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, time 1000ms > rtt min/avg/max/mdev = 0.015/0.016/0.017/0.001 ms > ==== > [root@node2 ~]# uname -n > node2.ccrs.bcn > [root@node2 ~]# ping -c 2 node1.ccrs.bcn > PING node1.bcn (10.20.10.1) 56(84) bytes of data. > 64 bytes from node1.bcn (10.20.10.1): icmp_seq=1 ttl=64 time=0.079 ms > 64 bytes from node1.bcn (10.20.10.1): icmp_seq=2 ttl=64 time=0.076 ms > > --- node1.bcn ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, time 999ms > rtt min/avg/max/mdev = 0.076/0.077/0.079/0.008 ms > ==== > > I have RRP configured and pings work on the second network, too: > > ==== > [root@node1 ~]# corosync-objctl |grep ring -A 5 > totem.interface.ringnumber=0 > totem.interface.bindnetaddr=10.20.10.1 > totem.interface.mcastaddr=239.192.100.163 > totem.interface.mcastport=5405 > totem.interface.member.memberaddr=node1.ccrs.bcn > totem.interface.member.memberaddr=node2.ccrs.bcn > totem.interface.ringnumber=1 > totem.interface.bindnetaddr=10.10.10.1 > totem.interface.mcastaddr=239.192.100.164 > totem.interface.mcastport=5405 > totem.interface.member.memberaddr=node1.sn > totem.interface.member.memberaddr=node2.sn > > [root@node1 ~]# ping -c 2 node2.sn > PING node2.sn (10.10.10.2) 56(84) bytes of data. > 64 bytes from node2.sn (10.10.10.2): icmp_seq=1 ttl=64 time=0.111 ms > 64 bytes from node2.sn (10.10.10.2): icmp_seq=2 ttl=64 time=0.120 ms > > --- node2.sn ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, time 999ms > rtt min/avg/max/mdev = 0.111/0.115/0.120/0.011 ms > ==== > [root@node2 ~]# ping -c 2 node1.sn > PING node1.sn (10.10.10.1) 56(84) bytes of data. > 64 bytes from node1.sn (10.10.10.1): icmp_seq=1 ttl=64 time=0.079 ms > 64 bytes from node1.sn (10.10.10.1): icmp_seq=2 ttl=64 time=0.171 ms > > --- node1.sn ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, time 1000ms > rtt min/avg/max/mdev = 0.079/0.125/0.171/0.046 ms > ==== > > Here is the cluster.conf: > > ==== > [root@node1 ~]# cat /etc/cluster/cluster.conf > <?xml version="1.0"?> > <cluster name="ccrs" config_version="1"> > <cman expected_votes="1" two_node="1" transport="udpu" /> > <clusternodes> > <clusternode name="node1.ccrs.bcn" nodeid="1"> > <altname name="node1.sn" /> > <fence> > <method name="ipmi"> > <device name="ipmi_n01" > ipaddr="10.250.199.15" login="admin" > passwd="secret" delay="15" action="reboot" /> > </method> > <method name="pdu"> > <device name="pdu01" port="1" > action="reboot" /> > <device name="pdu02" port="1" > action="reboot" /> > </method> > </fence> > </clusternode> > <clusternode name="node2.ccrs.bcn" nodeid="2"> > <altname name="node2.sn" /> > <fence> > <method name="ipmi"> > <device name="ipmi_n02" > ipaddr="10.250.199.17" login="admin" > passwd="secret" action="reboot" /> > </method> > <method name="pdu"> > <device name="pdu01" port="2" > action="reboot" /> > <device name="pdu02" port="2" > action="reboot" /> > </method> > </fence> > </clusternode> > </clusternodes> > <fencedevices> > <fencedevice name="ipmi_n01" agent="fence_ipmilan" /> > <fencedevice name="ipmi_n02" agent="fence_ipmilan" /> > <fencedevice name="pdu01" agent="fence_raritan_snmp" > ipaddr="pdu1A" /> > <fencedevice name="pdu02" agent="fence_raritan_snmp" > ipaddr="pdu1B" /> > <fencedevice name="pdu03" agent="fence_raritan_snmp" > ipaddr="pdu2A" /> > <fencedevice name="pdu04" agent="fence_raritan_snmp" > ipaddr="pdu2B" /> > </fencedevices> > <fence_daemon post_join_delay="30" /> > <totem rrp_mode="passive" secauth="off"/> > <rm log_level="5"> > <resources> > <script file="/etc/init.d/drbd" name="drbd"/> > <script file="/etc/init.d/wait-for-drbd" > name="wait-for-drbd"/> > <script file="/etc/init.d/clvmd" name="clvmd"/> > <clusterfs device="/dev/node1_vg0/shared" > force_unmount="1" > fstype="gfs2" mountpoint="/shared" name="sharedfs" /> > <script file="/etc/init.d/libvirtd" name="libvirtd"/> > </resources> > <failoverdomains> > <failoverdomain name="only_n01" nofailback="1" > ordered="0" > restricted="1"> > <failoverdomainnode name="node1.ccrs.bcn"/> > </failoverdomain> > <failoverdomain name="only_n02" nofailback="1" > ordered="0" > restricted="1"> > <failoverdomainnode name="node2.ccrs.bcn"/> > </failoverdomain> > <failoverdomain name="primary_n01" nofailback="1" > ordered="1" > restricted="1"> > <failoverdomainnode name="node1.ccrs.bcn" > priority="1"/> > <failoverdomainnode name="node2.ccrs.bcn" > priority="2"/> > </failoverdomain> > <failoverdomain name="primary_n02" nofailback="1" > ordered="1" > restricted="1"> > <failoverdomainnode name="node1.ccrs.bcn" > priority="2"/> > <failoverdomainnode name="node2.ccrs.bcn" > priority="1"/> > </failoverdomain> > </failoverdomains> > <service name="storage_n01" autostart="1" domain="only_n01" > exclusive="0" recovery="restart"> > <script ref="drbd"> > <script ref="wait-for-drbd"> > <script ref="clvmd"> > <clusterfs ref="sharedfs"/> > </script> > </script> > </script> > </service> > <service name="storage_n02" autostart="1" domain="only_n02" > exclusive="0" recovery="restart"> > <script ref="drbd"> > <script ref="wait-for-drbd"> > <script ref="clvmd"> > <clusterfs ref="sharedfs"/> > </script> > </script> > </script> > </service> > <service name="libvirtd_n01" autostart="1" domain="only_n01" > exclusive="0" recovery="restart"> > <script ref="libvirtd"/> > </service> > <service name="libvirtd_n02" autostart="1" domain="only_n02" > exclusive="0" recovery="restart"> > <script ref="libvirtd"/> > </service> > </rm> > </cluster> > ==== > > Nothing special there at all. > > While writing this email though, I saw this on the other node: > > ==== > Sep 24 23:03:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 14e > Sep 24 23:03:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 14e > Sep 24 23:03:49 node1 corosync[4770]: [TOTEM ] Retransmit List: 158 > Sep 24 23:03:49 node1 corosync[4770]: [TOTEM ] Retransmit List: 15a > Sep 24 23:03:49 node1 corosync[4770]: [TOTEM ] Retransmit List: 15a > Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161 > Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161 > Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161 > Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161 163 > Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 163 > Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 177 > Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 177 > Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 179 > Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 179 > Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 181 > Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 181 > Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 181 > Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 183 > Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 183 > Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18c > Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18c > Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18c 18e > Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18e > Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23c > Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23c > Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23c > Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23e > Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23e > Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 247 > Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 247 > Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 249 > Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 24b > Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 24b > Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 252 > Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 252 > Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 254 > Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 254 > ==== > > Certainly *looks* like a network problem, but I can't see what's > wrong... Any ideas? > > Thanks! > _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org