Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE
On 28/11/2018 08:34, Jan Friesse wrote: Anyway, problem is solved and if it appears again, please try to check that corosync.conf is equal on all nodes. I'd propose that (if devel wizzs read here) that some checks in pcs should be implemented to account for ruby (variants/versions compatibility) when pcs does its magic. many thanks, L. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE
lejeczek napsal(a): On 23/11/2018 16:36, Jan Friesse wrote: lejeczek, On 15/10/2018 07:24, Jan Friesse wrote: lejeczek, hi guys, I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or probably something else in between) is not right. I see this: $ pcs status --all Cluster name: CC Stack: corosync Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Fri Oct 12 15:40:39 2018 Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on whale.private 3 nodes configured 8 resources configured (1 DISABLED) Online: [ rental.private whale.private ] OFFLINE: [ rider.private ] and that third node logs: [TOTEM ] FAILED TO RECEIVE [TOTEM ] A new membership (10.5.6.100:2504344) was formed. Members left: 2 4 [TOTEM ] Failed to receive the leave message. failed: 2 4 [QUORUM] Members[1]: 1 [MAIN ] Completed service synchronization, ready to provide service. [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members joined: 2 4 [TOTEM ] FAILED TO RECEIVE and it just keeps going like that. Sometimes reboot(or stop of services + wait + start) of that third node would help. But, I get this situation almost every time a node gets (orderly) shut down or reboot. Network-wise, connectivity, seem okey. Where to start? A little more information would be helpful (corosync version, used protocol - udpu/udp, corosync.conf, ...), but few possible problems: - If UDP (multicast) is used, try UDPU - Check firewall - Try reduce MTU used by corosync (option netmtu in corosync.conf) Regards, Honza One thing I remember - could it be that because at the time of cluster formation(and for some time after) one of the nodes had a different ruby version from what other nodes had? Probably not, because corosync itself does not have any dependency on ruby. It might have been the root cause. I do not want to jinx it but I removed and added two nodes which seems like were in some kind of conflict (now with save ruby on all) and seems problem is gone, so far. Ok so maybe pcsd was unable to distribute corosync.conf to all nodes correctly resulting in weird behavior of corosync. But 'pcsd' does use ruby, no? Yes, pcsd is using ruby. But corosync itself is not using pcsd. Anyway, problem is solved and if it appears again, please try to check that corosync.conf is equal on all nodes. Regards, Honza I cannot remember when this problem started to appear, was if from the beginning or later, cannot say. I'm on Centos 7.6. I do not think I use UDP (other then creation of some resources and constrains it's a "vanilla" cluster). I use a That's why I've asked for config files ;) "non-default" MTU on the ifaces cluster uses, and also, those interfaces are net-team devices. But still.. why it always be that one node (all So it's probably really MTU, please try change option netmtu in corosync.conf. are virtually identical) Evil is usually hidden in detail so virtually identical may mean it's not identical enough. many thanks, L. Np, but I'm not sure if hints were useful for you or not. Regards, Honza many thanks, L ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE
On 23/11/2018 16:36, Jan Friesse wrote: lejeczek, On 15/10/2018 07:24, Jan Friesse wrote: lejeczek, hi guys, I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or probably something else in between) is not right. I see this: $ pcs status --all Cluster name: CC Stack: corosync Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Fri Oct 12 15:40:39 2018 Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on whale.private 3 nodes configured 8 resources configured (1 DISABLED) Online: [ rental.private whale.private ] OFFLINE: [ rider.private ] and that third node logs: [TOTEM ] FAILED TO RECEIVE [TOTEM ] A new membership (10.5.6.100:2504344) was formed. Members left: 2 4 [TOTEM ] Failed to receive the leave message. failed: 2 4 [QUORUM] Members[1]: 1 [MAIN ] Completed service synchronization, ready to provide service. [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members joined: 2 4 [TOTEM ] FAILED TO RECEIVE and it just keeps going like that. Sometimes reboot(or stop of services + wait + start) of that third node would help. But, I get this situation almost every time a node gets (orderly) shut down or reboot. Network-wise, connectivity, seem okey. Where to start? A little more information would be helpful (corosync version, used protocol - udpu/udp, corosync.conf, ...), but few possible problems: - If UDP (multicast) is used, try UDPU - Check firewall - Try reduce MTU used by corosync (option netmtu in corosync.conf) Regards, Honza One thing I remember - could it be that because at the time of cluster formation(and for some time after) one of the nodes had a different ruby version from what other nodes had? Probably not, because corosync itself does not have any dependency on ruby. It might have been the root cause. I do not want to jinx it but I removed and added two nodes which seems like were in some kind of conflict (now with save ruby on all) and seems problem is gone, so far. But 'pcsd' does use ruby, no? I cannot remember when this problem started to appear, was if from the beginning or later, cannot say. I'm on Centos 7.6. I do not think I use UDP (other then creation of some resources and constrains it's a "vanilla" cluster). I use a That's why I've asked for config files ;) "non-default" MTU on the ifaces cluster uses, and also, those interfaces are net-team devices. But still.. why it always be that one node (all So it's probably really MTU, please try change option netmtu in corosync.conf. are virtually identical) Evil is usually hidden in detail so virtually identical may mean it's not identical enough. many thanks, L. Np, but I'm not sure if hints were useful for you or not. Regards, Honza many thanks, L ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE
lejeczek, On 15/10/2018 07:24, Jan Friesse wrote: lejeczek, hi guys, I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or probably something else in between) is not right. I see this: $ pcs status --all Cluster name: CC Stack: corosync Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Fri Oct 12 15:40:39 2018 Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on whale.private 3 nodes configured 8 resources configured (1 DISABLED) Online: [ rental.private whale.private ] OFFLINE: [ rider.private ] and that third node logs: [TOTEM ] FAILED TO RECEIVE [TOTEM ] A new membership (10.5.6.100:2504344) was formed. Members left: 2 4 [TOTEM ] Failed to receive the leave message. failed: 2 4 [QUORUM] Members[1]: 1 [MAIN ] Completed service synchronization, ready to provide service. [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members joined: 2 4 [TOTEM ] FAILED TO RECEIVE and it just keeps going like that. Sometimes reboot(or stop of services + wait + start) of that third node would help. But, I get this situation almost every time a node gets (orderly) shut down or reboot. Network-wise, connectivity, seem okey. Where to start? A little more information would be helpful (corosync version, used protocol - udpu/udp, corosync.conf, ...), but few possible problems: - If UDP (multicast) is used, try UDPU - Check firewall - Try reduce MTU used by corosync (option netmtu in corosync.conf) Regards, Honza One thing I remember - could it be that because at the time of cluster formation(and for some time after) one of the nodes had a different ruby version from what other nodes had? Probably not, because corosync itself does not have any dependency on ruby. I cannot remember when this problem started to appear, was if from the beginning or later, cannot say. I'm on Centos 7.6. I do not think I use UDP (other then creation of some resources and constrains it's a "vanilla" cluster). I use a That's why I've asked for config files ;) "non-default" MTU on the ifaces cluster uses, and also, those interfaces are net-team devices. But still.. why it always be that one node (all So it's probably really MTU, please try change option netmtu in corosync.conf. are virtually identical) Evil is usually hidden in detail so virtually identical may mean it's not identical enough. many thanks, L. Np, but I'm not sure if hints were useful for you or not. Regards, Honza many thanks, L ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE
On 15/10/2018 07:24, Jan Friesse wrote: lejeczek, hi guys, I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or probably something else in between) is not right. I see this: $ pcs status --all Cluster name: CC Stack: corosync Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Fri Oct 12 15:40:39 2018 Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on whale.private 3 nodes configured 8 resources configured (1 DISABLED) Online: [ rental.private whale.private ] OFFLINE: [ rider.private ] and that third node logs: [TOTEM ] FAILED TO RECEIVE [TOTEM ] A new membership (10.5.6.100:2504344) was formed. Members left: 2 4 [TOTEM ] Failed to receive the leave message. failed: 2 4 [QUORUM] Members[1]: 1 [MAIN ] Completed service synchronization, ready to provide service. [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members joined: 2 4 [TOTEM ] FAILED TO RECEIVE and it just keeps going like that. Sometimes reboot(or stop of services + wait + start) of that third node would help. But, I get this situation almost every time a node gets (orderly) shut down or reboot. Network-wise, connectivity, seem okey. Where to start? A little more information would be helpful (corosync version, used protocol - udpu/udp, corosync.conf, ...), but few possible problems: - If UDP (multicast) is used, try UDPU - Check firewall - Try reduce MTU used by corosync (option netmtu in corosync.conf) Regards, Honza One thing I remember - could it be that because at the time of cluster formation(and for some time after) one of the nodes had a different ruby version from what other nodes had? I cannot remember when this problem started to appear, was if from the beginning or later, cannot say. I'm on Centos 7.6. I do not think I use UDP (other then creation of some resources and constrains it's a "vanilla" cluster). I use a "non-default" MTU on the ifaces cluster uses, and also, those interfaces are net-team devices. But still.. why it always be that one node (all are virtually identical) many thanks, L. many thanks, L ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE
lejeczek, hi guys, I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or probably something else in between) is not right. I see this: $ pcs status --all Cluster name: CC Stack: corosync Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Fri Oct 12 15:40:39 2018 Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on whale.private 3 nodes configured 8 resources configured (1 DISABLED) Online: [ rental.private whale.private ] OFFLINE: [ rider.private ] and that third node logs: [TOTEM ] FAILED TO RECEIVE [TOTEM ] A new membership (10.5.6.100:2504344) was formed. Members left: 2 4 [TOTEM ] Failed to receive the leave message. failed: 2 4 [QUORUM] Members[1]: 1 [MAIN ] Completed service synchronization, ready to provide service. [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members joined: 2 4 [TOTEM ] FAILED TO RECEIVE and it just keeps going like that. Sometimes reboot(or stop of services + wait + start) of that third node would help. But, I get this situation almost every time a node gets (orderly) shut down or reboot. Network-wise, connectivity, seem okey. Where to start? A little more information would be helpful (corosync version, used protocol - udpu/udp, corosync.conf, ...), but few possible problems: - If UDP (multicast) is used, try UDPU - Check firewall - Try reduce MTU used by corosync (option netmtu in corosync.conf) Regards, Honza many thanks, L ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE
On Fri, 2018-10-12 at 15:51 +0100, lejeczek wrote: > hi guys, > I have a 3-node cluser(centos 7.5), 2 nodes seems fine but > third(or probably something else in between) is not right. > I see this: > > $ pcs status --all > Cluster name: CC > Stack: corosync > Current DC: whale.private (version > 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum > Last updated: Fri Oct 12 15:40:39 2018 > Last change: Fri Oct 12 15:14:57 2018 by root via > crm_resource on whale.private > > 3 nodes configured > 8 resources configured (1 DISABLED) > > Online: [ rental.private whale.private ] > OFFLINE: [ rider.private ] > > and that third node logs: > > [TOTEM ] FAILED TO RECEIVE > [TOTEM ] A new membership (10.5.6.100:2504344) was formed. > Members left: 2 4 > [TOTEM ] Failed to receive the leave message. failed: 2 4 > [QUORUM] Members[1]: 1 > [MAIN ] Completed service synchronization, ready to > provide service. > [TOTEM ] A new membership (10.5.6.49:2504348) was formed. > Members joined: 2 4 > [TOTEM ] FAILED TO RECEIVE > > and it just keeps going like that. > Sometimes reboot(or stop of services + wait + start) of that > third node would help. > But, I get this situation almost every time a node gets > (orderly) shut down or reboot. > Network-wise, connectivity, seem okey. Where to start? > > many thanks, L Odd. I'd recommend turning on debug logging in corosync.conf, and posting the log here. Hopefully one of the corosync developers can chime in at that point. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE
hi guys, I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or probably something else in between) is not right. I see this: $ pcs status --all Cluster name: CC Stack: corosync Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Fri Oct 12 15:40:39 2018 Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on whale.private 3 nodes configured 8 resources configured (1 DISABLED) Online: [ rental.private whale.private ] OFFLINE: [ rider.private ] and that third node logs: [TOTEM ] FAILED TO RECEIVE [TOTEM ] A new membership (10.5.6.100:2504344) was formed. Members left: 2 4 [TOTEM ] Failed to receive the leave message. failed: 2 4 [QUORUM] Members[1]: 1 [MAIN ] Completed service synchronization, ready to provide service. [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members joined: 2 4 [TOTEM ] FAILED TO RECEIVE and it just keeps going like that. Sometimes reboot(or stop of services + wait + start) of that third node would help. But, I get this situation almost every time a node gets (orderly) shut down or reboot. Network-wise, connectivity, seem okey. Where to start? many thanks, L ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org