> Everything is working good until I try to add a new node. As soon as I > do that, whole GUI breaks (KVM stays working, luckily) and "all hell > breaks loose," as it's said. > > So, we have eliminated network card issues - as this problem occurs with > different network cards. We have eliminated switches' issues, because > all switches are working prior to this situation AND we have tried to > use 10GB->1GB gbic module to connect this new node to 10G switch as well. > Now, we have eliminated this Fujitsu hardware totally, because a HP > machine also breaks the cluster. > > IGMP snooping is disabled, multicast is working on both sides, tested > with ssmping. > > *clustat* shows that all nodes are online. > > *pvecm nodes* shows that everything is OK. All nodes have "join" time > and "M" in Sts column. "Inc" differs, though. > > *tcpdump* shows: > > 12:15:57.535798 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2 > > 12:15:57.535831 IP6 101:80a:30b:6e28:cd3:1d7f:2f00:0 > ff02::1: HBH > > ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 24 > > 12:15:57.540356 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2 > > 12:15:57.540384 IP6 101:80a:21ee:154d:100:: > ff02::1: HBH ICMP6, > > multicast listener querymax resp delay: 1000 addr: ::, length 24 > > 12:15:57.580874 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2 > > 12:15:57.580903 IP6 10::40:918f:a47f:0 > ff02::1: HBH ICMP6, multicast > > listener querymax resp delay: 1000 addr: ::, length 24 > > 12:15:58.349706 IP valitseja.5404 > harija1.5405: UDP, length 107 > > 12:15:58.349783 IP harija1.5404 > ve-1.5405: UDP, length 617 > > 12:16:10.980002 ARP, Reply ve-1 is-at 90:e2:ba:3a:6e:d0 (oui Unknown), > > length 42 > > Output from log files: > > Apr 09 11:25:26 corosync [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 > > 13 14 15 > > Apr 9 11:30:27 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry > > 2960 > > Apr 9 11:30:28 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry > > 2970 > > Apr 9 11:30:29 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry > > 2980 > > Apr 9 11:30:30 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry > > 2990 > > Apr 9 11:30:31 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry > > 3000 > > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry > > 3010 > > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: > > cpg_send_message failed: 9 > > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: > > cpg_send_message failed: 9 > > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: > > cpg_send_message failed: 9 > > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: > > cpg_send_message failed: 9 > > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: > > cpg_send_message failed: 9 > > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: > > cpg_send_message failed: 9 > > Apr 9 11:30:33 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry > > 3020 > > Apr 9 11:30:34 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry > > 3030 > > Apr 9 11:30:35 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry > > 3040 > > Apr 9 11:30:36 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry > > 3050 > > I have read that Proxmox tests with 16 working nodes, but there are > information that someone uses it with more than 16. Although - I have > plenty to go? Of course we have had nodes, which are not in cluster > anymore (deleted), but I assume that they don't count. :) > > Any ideas where to look next?
Does it help if you restart pve-cluster service on those nodes: # service pve-cluster restart _______________________________________________ pve-user mailing list [email protected] http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
