Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-12-03 Thread lejeczek

On 28/11/2018 08:34, Jan Friesse wrote:
Anyway, problem is solved and if it appears again, please try to check 
that corosync.conf is equal on all nodes. 


I'd propose that (if devel wizzs read here) that some checks in pcs 
should be implemented to account for ruby (variants/versions 
compatibility) when pcs does its magic.


many thanks, L.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-11-28 Thread Jan Friesse

lejeczek napsal(a):

On 23/11/2018 16:36, Jan Friesse wrote:

lejeczek,


On 15/10/2018 07:24, Jan Friesse wrote:

lejeczek,


hi guys,
I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or 
probably something else in between) is not right.

I see this:

  $ pcs status --all
Cluster name: CC
Stack: corosync
Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - 
partition with quorum

Last updated: Fri Oct 12 15:40:39 2018
Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on 
whale.private


3 nodes configured
8 resources configured (1 DISABLED)

Online: [ rental.private whale.private ]
OFFLINE: [ rider.private ]

and that third node logs:

[TOTEM ] FAILED TO RECEIVE
  [TOTEM ] A new membership (10.5.6.100:2504344) was formed. 
Members left: 2 4

  [TOTEM ] Failed to receive the leave message. failed: 2 4
  [QUORUM] Members[1]: 1
  [MAIN  ] Completed service synchronization, ready to provide 
service.
  [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members 
joined: 2 4

  [TOTEM ] FAILED TO RECEIVE

and it just keeps going like that.
Sometimes reboot(or stop of services + wait + start) of that third 
node would help.
But, I get this situation almost every time a node gets (orderly) 
shut down or reboot.

Network-wise, connectivity, seem okey. Where to start?



A little more information would be helpful (corosync version, used 
protocol - udpu/udp, corosync.conf, ...), but few possible problems:

- If UDP (multicast) is used, try UDPU
- Check firewall
- Try reduce MTU used by corosync (option netmtu in corosync.conf)

Regards,
  Honza

One thing I remember - could it be that because at the time of 
cluster formation(and for some time after) one of the nodes had a 
different ruby version from what other nodes had?


Probably not, because corosync itself does not have any dependency on 
ruby.


It might have been the root cause. I do not want to jinx it but I 
removed and added two nodes which seems like were in some kind of 
conflict (now with save ruby on all) and seems problem is gone, so far. 


Ok so maybe pcsd was unable to distribute corosync.conf to all nodes 
correctly resulting in weird behavior of corosync.



But 'pcsd' does use ruby, no?


Yes, pcsd is using ruby. But corosync itself is not using pcsd.

Anyway, problem is solved and if it appears again, please try to check 
that corosync.conf is equal on all nodes.


Regards,
  Honza








I cannot remember when this problem started to appear, was if from 
the beginning or later, cannot say.


I'm on Centos 7.6. I do not think I use UDP (other then creation of 
some resources and constrains it's a "vanilla" cluster). I use a 


That's why I've asked for config files ;)

"non-default" MTU on the ifaces cluster uses, and also, those 
interfaces are net-team devices. But still.. why it always be that 
one node (all 


So it's probably really MTU, please try change option netmtu in 
corosync.conf.



are virtually identical)


Evil is usually hidden in detail so virtually identical may mean it's 
not identical enough.





many thanks, L.


Np, but I'm not sure if hints were useful for you or not.

Regards,
  Honza







many thanks, L
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org










___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-11-27 Thread lejeczek

On 23/11/2018 16:36, Jan Friesse wrote:

lejeczek,


On 15/10/2018 07:24, Jan Friesse wrote:

lejeczek,


hi guys,
I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or 
probably something else in between) is not right.

I see this:

  $ pcs status --all
Cluster name: CC
Stack: corosync
Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - 
partition with quorum

Last updated: Fri Oct 12 15:40:39 2018
Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on 
whale.private


3 nodes configured
8 resources configured (1 DISABLED)

Online: [ rental.private whale.private ]
OFFLINE: [ rider.private ]

and that third node logs:

[TOTEM ] FAILED TO RECEIVE
  [TOTEM ] A new membership (10.5.6.100:2504344) was formed. 
Members left: 2 4

  [TOTEM ] Failed to receive the leave message. failed: 2 4
  [QUORUM] Members[1]: 1
  [MAIN  ] Completed service synchronization, ready to provide 
service.
  [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members 
joined: 2 4

  [TOTEM ] FAILED TO RECEIVE

and it just keeps going like that.
Sometimes reboot(or stop of services + wait + start) of that third 
node would help.
But, I get this situation almost every time a node gets (orderly) 
shut down or reboot.

Network-wise, connectivity, seem okey. Where to start?



A little more information would be helpful (corosync version, used 
protocol - udpu/udp, corosync.conf, ...), but few possible problems:

- If UDP (multicast) is used, try UDPU
- Check firewall
- Try reduce MTU used by corosync (option netmtu in corosync.conf)

Regards,
  Honza

One thing I remember - could it be that because at the time of 
cluster formation(and for some time after) one of the nodes had a 
different ruby version from what other nodes had?


Probably not, because corosync itself does not have any dependency on 
ruby.


It might have been the root cause. I do not want to jinx it but I 
removed and added two nodes which seems like were in some kind of 
conflict (now with save ruby on all) and seems problem is gone, so far. 
But 'pcsd' does use ruby, no?







I cannot remember when this problem started to appear, was if from 
the beginning or later, cannot say.


I'm on Centos 7.6. I do not think I use UDP (other then creation of 
some resources and constrains it's a "vanilla" cluster). I use a 


That's why I've asked for config files ;)

"non-default" MTU on the ifaces cluster uses, and also, those 
interfaces are net-team devices. But still.. why it always be that 
one node (all 


So it's probably really MTU, please try change option netmtu in 
corosync.conf.



are virtually identical)


Evil is usually hidden in detail so virtually identical may mean it's 
not identical enough.





many thanks, L.


Np, but I'm not sure if hints were useful for you or not.

Regards,
  Honza







many thanks, L
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org








___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-11-23 Thread Jan Friesse

lejeczek,


On 15/10/2018 07:24, Jan Friesse wrote:

lejeczek,


hi guys,
I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or 
probably something else in between) is not right.

I see this:

  $ pcs status --all
Cluster name: CC
Stack: corosync
Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - 
partition with quorum

Last updated: Fri Oct 12 15:40:39 2018
Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on 
whale.private


3 nodes configured
8 resources configured (1 DISABLED)

Online: [ rental.private whale.private ]
OFFLINE: [ rider.private ]

and that third node logs:

[TOTEM ] FAILED TO RECEIVE
  [TOTEM ] A new membership (10.5.6.100:2504344) was formed. Members 
left: 2 4

  [TOTEM ] Failed to receive the leave message. failed: 2 4
  [QUORUM] Members[1]: 1
  [MAIN  ] Completed service synchronization, ready to provide service.
  [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members 
joined: 2 4

  [TOTEM ] FAILED TO RECEIVE

and it just keeps going like that.
Sometimes reboot(or stop of services + wait + start) of that third 
node would help.
But, I get this situation almost every time a node gets (orderly) 
shut down or reboot.

Network-wise, connectivity, seem okey. Where to start?



A little more information would be helpful (corosync version, used 
protocol - udpu/udp, corosync.conf, ...), but few possible problems:

- If UDP (multicast) is used, try UDPU
- Check firewall
- Try reduce MTU used by corosync (option netmtu in corosync.conf)

Regards,
  Honza

One thing I remember - could it be that because at the time of cluster 
formation(and for some time after) one of the nodes had a different ruby 
version from what other nodes had?


Probably not, because corosync itself does not have any dependency on ruby.



I cannot remember when this problem started to appear, was if from the 
beginning or later, cannot say.


I'm on Centos 7.6. I do not think I use UDP (other then creation of some 
resources and constrains it's a "vanilla" cluster). I use a 


That's why I've asked for config files ;)

"non-default" MTU on the ifaces cluster uses, and also, those interfaces 
are net-team devices. But still.. why it always be that one node (all 


So it's probably really MTU, please try change option netmtu in 
corosync.conf.



are virtually identical)


Evil is usually hidden in detail so virtually identical may mean it's 
not identical enough.





many thanks, L.


Np, but I'm not sure if hints were useful for you or not.

Regards,
  Honza







many thanks, L
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org






___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-11-23 Thread lejeczek

On 15/10/2018 07:24, Jan Friesse wrote:

lejeczek,


hi guys,
I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or 
probably something else in between) is not right.

I see this:

  $ pcs status --all
Cluster name: CC
Stack: corosync
Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - 
partition with quorum

Last updated: Fri Oct 12 15:40:39 2018
Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on 
whale.private


3 nodes configured
8 resources configured (1 DISABLED)

Online: [ rental.private whale.private ]
OFFLINE: [ rider.private ]

and that third node logs:

[TOTEM ] FAILED TO RECEIVE
  [TOTEM ] A new membership (10.5.6.100:2504344) was formed. Members 
left: 2 4

  [TOTEM ] Failed to receive the leave message. failed: 2 4
  [QUORUM] Members[1]: 1
  [MAIN  ] Completed service synchronization, ready to provide service.
  [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members 
joined: 2 4

  [TOTEM ] FAILED TO RECEIVE

and it just keeps going like that.
Sometimes reboot(or stop of services + wait + start) of that third 
node would help.
But, I get this situation almost every time a node gets (orderly) 
shut down or reboot.

Network-wise, connectivity, seem okey. Where to start?



A little more information would be helpful (corosync version, used 
protocol - udpu/udp, corosync.conf, ...), but few possible problems:

- If UDP (multicast) is used, try UDPU
- Check firewall
- Try reduce MTU used by corosync (option netmtu in corosync.conf)

Regards,
  Honza

One thing I remember - could it be that because at the time of cluster 
formation(and for some time after) one of the nodes had a different ruby 
version from what other nodes had?


I cannot remember when this problem started to appear, was if from the 
beginning or later, cannot say.


I'm on Centos 7.6. I do not think I use UDP (other then creation of some 
resources and constrains it's a "vanilla" cluster). I use a 
"non-default" MTU on the ifaces cluster uses, and also, those interfaces 
are net-team devices. But still.. why it always be that one node (all 
are virtually identical)


many thanks, L.





many thanks, L
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-10-15 Thread Jan Friesse

lejeczek,


hi guys,
I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or 
probably something else in between) is not right.

I see this:

  $ pcs status --all
Cluster name: CC
Stack: corosync
Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - 
partition with quorum

Last updated: Fri Oct 12 15:40:39 2018
Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on 
whale.private


3 nodes configured
8 resources configured (1 DISABLED)

Online: [ rental.private whale.private ]
OFFLINE: [ rider.private ]

and that third node logs:

[TOTEM ] FAILED TO RECEIVE
  [TOTEM ] A new membership (10.5.6.100:2504344) was formed. Members 
left: 2 4

  [TOTEM ] Failed to receive the leave message. failed: 2 4
  [QUORUM] Members[1]: 1
  [MAIN  ] Completed service synchronization, ready to provide service.
  [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members 
joined: 2 4

  [TOTEM ] FAILED TO RECEIVE

and it just keeps going like that.
Sometimes reboot(or stop of services + wait + start) of that third node 
would help.
But, I get this situation almost every time a node gets (orderly) shut 
down or reboot.

Network-wise, connectivity, seem okey. Where to start?



A little more information would be helpful (corosync version, used 
protocol - udpu/udp, corosync.conf, ...), but few possible problems:

- If UDP (multicast) is used, try UDPU
- Check firewall
- Try reduce MTU used by corosync (option netmtu in corosync.conf)

Regards,
  Honza



many thanks, L
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-10-12 Thread Ken Gaillot
On Fri, 2018-10-12 at 15:51 +0100, lejeczek wrote:
> hi guys,
> I have a 3-node cluser(centos 7.5), 2 nodes seems fine but 
> third(or probably something else in between) is not right.
> I see this:
> 
>   $ pcs status --all
> Cluster name: CC
> Stack: corosync
> Current DC: whale.private (version 
> 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
> Last updated: Fri Oct 12 15:40:39 2018
> Last change: Fri Oct 12 15:14:57 2018 by root via 
> crm_resource on whale.private
> 
> 3 nodes configured
> 8 resources configured (1 DISABLED)
> 
> Online: [ rental.private whale.private ]
> OFFLINE: [ rider.private ]
> 
> and that third node logs:
> 
> [TOTEM ] FAILED TO RECEIVE
>   [TOTEM ] A new membership (10.5.6.100:2504344) was formed. 
> Members left: 2 4
>   [TOTEM ] Failed to receive the leave message. failed: 2 4
>   [QUORUM] Members[1]: 1
>   [MAIN  ] Completed service synchronization, ready to 
> provide service.
>   [TOTEM ] A new membership (10.5.6.49:2504348) was formed. 
> Members joined: 2 4
>   [TOTEM ] FAILED TO RECEIVE
> 
> and it just keeps going like that.
> Sometimes reboot(or stop of services + wait + start) of that 
> third node would help.
> But, I get this situation almost every time a node gets 
> (orderly) shut down or reboot.
> Network-wise, connectivity, seem okey. Where to start?
> 
> many thanks, L

Odd. I'd recommend turning on debug logging in corosync.conf, and
posting the log here. Hopefully one of the corosync developers can
chime in at that point.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-10-12 Thread lejeczek

hi guys,
I have a 3-node cluser(centos 7.5), 2 nodes seems fine but 
third(or probably something else in between) is not right.

I see this:

 $ pcs status --all
Cluster name: CC
Stack: corosync
Current DC: whale.private (version 
1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum

Last updated: Fri Oct 12 15:40:39 2018
Last change: Fri Oct 12 15:14:57 2018 by root via 
crm_resource on whale.private


3 nodes configured
8 resources configured (1 DISABLED)

Online: [ rental.private whale.private ]
OFFLINE: [ rider.private ]

and that third node logs:

[TOTEM ] FAILED TO RECEIVE
 [TOTEM ] A new membership (10.5.6.100:2504344) was formed. 
Members left: 2 4

 [TOTEM ] Failed to receive the leave message. failed: 2 4
 [QUORUM] Members[1]: 1
 [MAIN  ] Completed service synchronization, ready to 
provide service.
 [TOTEM ] A new membership (10.5.6.49:2504348) was formed. 
Members joined: 2 4

 [TOTEM ] FAILED TO RECEIVE

and it just keeps going like that.
Sometimes reboot(or stop of services + wait + start) of that 
third node would help.
But, I get this situation almost every time a node gets 
(orderly) shut down or reboot.

Network-wise, connectivity, seem okey. Where to start?

many thanks, L
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org