Re: [ClusterLabs] Redudant Ring Network failure

2020-06-11 Thread Valentin Vidić
On Thu, Jun 11, 2020 at 09:46:14AM +0200, Jan Friesse wrote:
> > Jan,
> > 
> > actually we using this.
> > 
> > [root@lvm-nfscpdata-05ct::~ 100 ]# apt show corosync
> > Package: corosync
> > Version: 3.0.1-2+deb10u1
> > 
> > [root@lvm-nfscpdata-05ct::~]# apt show libknet1
> > Package: libknet1
> > Version: 1.8-2
> > 
> > This are the newest version provided on Mirror.
> 
> yup, but these are pretty old anyway and there is quite a few bugs (and many
> of them may explain behavior you see).
> 
> I would suggest you to either try a upstream code compilation or (if you
> need to stick with packages) you may give a try sid packages or Proxmox
> repositories (where knet is at version 1.15 and corosync 3.0.3).

I can't seem to reproduce this with Debian versions mentioned above on
3 nodes with 3 links. Can you check for each of the interfaces if the
traffic flows in both directions and also there is no DROP rule in
firewall after arrival?

For example on node2:

root@node2:~# tcpdump -pni ens7 src 192.168.44.11
root@node2:~# tcpdump -pni ens7 dst 192.168.44.11
root@node2:~# tcpdump -pni ens7 src 192.168.44.13
root@node2:~# tcpdump -pni ens7 dst 192.168.44.13

root@node2:~# tcpdump -pni ens8 src 10.0.44.11
root@node2:~# tcpdump -pni ens8 dst 10.0.44.11
root@node2:~# tcpdump -pni ens8 src 10.0.44.13
root@node2:~# tcpdump -pni ens8 dst 10.0.44.13

Also logs from corosync when this happens would be useful.

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Redudant Ring Network failure

2020-06-10 Thread ROHWEDER-NEUBECK, MICHAEL (EXTERN)
Hi,
yesterday we restart all cluster and all rings ok.
Now today 1. With broken ring.

ring 0 broken: 033

this is my cfg

[root@lvm-nfscpdata-05ct::~]# less /etc/corosync/corosync.conf
totem {
  version: 2
  transport:   knet
  cluster_name:nfscpdata
  token:   2000
  token_retransmits_before_loss_const: 10
  max_messages:150
  window_size: 300
  crypto_cipher:   aes256
  crypto_hash: sha256
  interface {
ringnumber: 0
  }
  interface {
ringnumber: 1
  }
}

logging {
  fileline:off
  to_stderr:   yes
  to_logfile:  no
  to_syslog:   yes
  syslog_facility: daemon
  syslog_priority: info
  debug:   off
  timestamp:   on
  logger_subsys {
subsys: QUORUM
debug:  off
  }
}

quorum {
  # Enable and configure quorum subsystem (default: off)
  # see also corosync.conf.5 and votequorum.5
  provider: corosync_votequorum
}

nodelist {
  node {
ring0_addr: 10.28.63.138
ring1_addr: 10.28.98.138
name: lvm-nfscpdata-04ct
nodeid: 1688
  }
  node {
ring0_addr: 10.28.63.139
ring1_addr: 10.28.98.139
name: lvm-nfscpdata-05ct
nodeid: 1689
  }
  node {
ring0_addr: 10.28.63.140
ring1_addr: 10.28.98.140
name: lvm-nfscpdata-06ct
nodeid: 1690
  }
}

Ring 1 managed by host firewall. But ports opend
Ring 0 no Firewall setting.




Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
Aktiengesellschaft, Koeln, Registereintragung / Registration: Amtsgericht Koeln 
HR B 2168
Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr. 
Karl-Ludwig Kley
Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), Thorsten 
Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef Kayser, Dr. Michael 
Niggemann


-Ursprüngliche Nachricht-
Von: Strahil Nikolov  
Gesendet: Dienstag, 9. Juni 2020 21:34
An: ROHWEDER-NEUBECK, MICHAEL (EXTERN) ; 
Cluster Labs - All topics related to open-source clustering welcomed 

Betreff: Re: [ClusterLabs] Redudant Ring Network failure

It  will  be hard to guess if you are  using sctp or udp/udpu.
If possible  share  the corosync.conf  (you can remove sensitive data,  but  
make it meaningful).

Are you using a firewall ? If yes  check :
1. Node firewall is not blocking  the communication on the specific  interfaces 
2. Verify with tcpdump that the heartbeats are received from the remote side.
3. Check for retransmissions or packet loss.

Usually you can find more details in the log specified in corosync.conf or in 
/var/log/messages (and also the journal).

Best Regards,
Strahil Nikolov

На 9 юни 2020 г. 21:11:02 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL (EXTERN)" 
 написа:
>Hi,
>
>we are using unicast ("knet")
>
>Greetings
>
>Michael
>
>
>
>
>Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
>Aktiengesellschaft, Koeln, Registereintragung / Registration:
>Amtsgericht Koeln HR B 2168
>Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr.
>Karl-Ludwig Kley
>Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), 
>Thorsten Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef 
>Kayser, Dr. Michael Niggemann
>
>
>-Ursprüngliche Nachricht-
>Von: Strahil Nikolov 
>Gesendet: Dienstag, 9. Juni 2020 19:30
>An: Cluster Labs - All topics related to open-source clustering 
>welcomed ; ROHWEDER-NEUBECK, MICHAEL (EXTERN) 
>
>Betreff: Re: [ClusterLabs] Redudant Ring Network failure
>
>Are you using multicast ?
>
>Best Regards,
>Strahil Nikolov
>
>На 9 юни 2020 г. 10:28:25 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL 
>(EXTERN)"  написа:
>>Hello,
>>We have massive problems with the redundant ring operation of our 
>>Corosync / pacemaker 3 Node NFS clusters.
>>
>>Most of the nodes either have an entire ring offline or only 1 node in
>
>>a ring.
>>Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 |
>Node3
>>Ring0 333 Ring 1 33n)
>>
>>corosync-cfgtool -R don't help
>>All nodes are VMs that build the ring together using 2 VLANs.
>>Which logs do you need to hopefully help me?
>>
>>Corosync Cluster Engine, version '3.0.1'
>>Copyright (c) 2006-2018 Red Hat, Inc.
>>Debian Buster
>>
>>
>>--
>>Mit freundlichen Grüßen
>>  Michael Rohweder-Neubeck
>>
>>NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
>>D-64291 Darmstadt
>>E-Mail:
>>m...@nsb-software.de<mailto:m...@nsb-software.de<mailto:mrn@nsb-software.
>>de%3cmailto:m...@nsb-software.de>>
>>Manager: Van-Hien Nguyen, Jörg Jaspert
>>U

Re: [ClusterLabs] Redudant Ring Network failure

2020-06-10 Thread ROHWEDER-NEUBECK, MICHAEL (EXTERN)
Jan,

actually we using this.

[root@lvm-nfscpdata-05ct::~ 100 ]# apt show corosync
Package: corosync
Version: 3.0.1-2+deb10u1

[root@lvm-nfscpdata-05ct::~]# apt show libknet1
Package: libknet1
Version: 1.8-2

This are the newest version provided on Mirror.




Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
Aktiengesellschaft, Koeln, Registereintragung / Registration: Amtsgericht Koeln 
HR B 2168
Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr. 
Karl-Ludwig Kley
Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), Thorsten 
Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef Kayser, Dr. Michael 
Niggemann


-Ursprüngliche Nachricht-
Von: Jan Friesse  
Gesendet: Mittwoch, 10. Juni 2020 09:24
An: Cluster Labs - All topics related to open-source clustering welcomed 
; ROHWEDER-NEUBECK, MICHAEL (EXTERN) 
; us...@lists.clusterlabs.org
Betreff: Re: [ClusterLabs] Redudant Ring Network failure

Michael,
what version of knet you are using? We had quite a few problems with older 
versions of knet, so current stable is recommended (1.16). Same applies for 
corosync because 3.0.4 has vastly improved display of links status.

> Hello,
> We have massive problems with the redundant ring operation of our Corosync / 
> pacemaker 3 Node NFS clusters.
> 
> Most of the nodes either have an entire ring offline or only 1 node in a ring.
> Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 | 
> Node3 Ring0 333 Ring 1 33n)

Doesn't seem completely wrong. You can ignore 'n' for ring 1, because that is 
localhost which is connected only on Ring 0 (3.0.4 has this output more 
consistent) so all nodes are connected at least via Ring 1. 
Ring 0 on node 2 seems to have some trouble with connection to node 1 but node 
1 (and 3) seems to be connected to node 2 just fine, so I think it is ether 
some bug in knet (probably already fixed) or some kind of firewall blocking 
just connection from node 2 to node 1 on ring 0.


> 
> corosync-cfgtool -R don't help
> All nodes are VMs that build the ring together using 2 VLANs.
> Which logs do you need to hopefully help me?

syslog/journal should contain everything needed especially when debug is 
enabled (corosync.conf - logging.debug: on)

Regards,
   Honza

> 
> Corosync Cluster Engine, version '3.0.1'
> Copyright (c) 2006-2018 Red Hat, Inc.
> Debian Buster
> 
> 
> --
> Mit freundlichen Grüßen
>Michael Rohweder-Neubeck
> 
> NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
> D-64291 Darmstadt
> E-Mail: 
> m...@nsb-software.de<mailto:m...@nsb-software.de<mailto:mrn@nsb-software
> .de%3cmailto:m...@nsb-software.de>>
> Manager: Van-Hien Nguyen, Jörg Jaspert
> USt-ID: DE 195 703 354; HRB 7131 Amtsgericht Darmstadt
> 
> 
> 
> 
> Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
> Aktiengesellschaft, Koeln, Registereintragung / Registration: 
> Amtsgericht Koeln HR B 2168 Vorsitzender des Aufsichtsrats / Chairman 
> of the Supervisory Board: Dr. Karl-Ludwig Kley Vorstand / Executive 
> Board: Carsten Spohr (Vorsitzender / Chairman), Thorsten Dirks, 
> Christina Foerster, Harry Hohmeister, Dr. Detlef Kayser, Dr. Michael 
> Niggemann
> 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Redudant Ring Network failure

2020-06-10 Thread Jan Friesse

Michael,
what version of knet you are using? We had quite a few problems with 
older versions of knet, so current stable is recommended (1.16). Same 
applies for corosync because 3.0.4 has vastly improved display of links 
status.



Hello,
We have massive problems with the redundant ring operation of our Corosync / 
pacemaker 3 Node NFS clusters.

Most of the nodes either have an entire ring offline or only 1 node in a ring.
Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 | Node3 Ring0 
333 Ring 1 33n)


Doesn't seem completely wrong. You can ignore 'n' for ring 1, because 
that is localhost which is connected only on Ring 0 (3.0.4 has this 
output more consistent) so all nodes are connected at least via Ring 1. 
Ring 0 on node 2 seems to have some trouble with connection to node 1 
but node 1 (and 3) seems to be connected to node 2 just fine, so I think 
it is ether some bug in knet (probably already fixed) or some kind of 
firewall blocking just connection from node 2 to node 1 on ring 0.





corosync-cfgtool -R don't help
All nodes are VMs that build the ring together using 2 VLANs.
Which logs do you need to hopefully help me?


syslog/journal should contain everything needed especially when debug is 
enabled (corosync.conf - logging.debug: on)


Regards,
  Honza



Corosync Cluster Engine, version '3.0.1'
Copyright (c) 2006-2018 Red Hat, Inc.
Debian Buster


--
Mit freundlichen Grüßen
   Michael Rohweder-Neubeck

NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
D-64291 Darmstadt
E-Mail: 
m...@nsb-software.de>
Manager: Van-Hien Nguyen, Jörg Jaspert
USt-ID: DE 195 703 354; HRB 7131 Amtsgericht Darmstadt




Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
Aktiengesellschaft, Koeln, Registereintragung / Registration: Amtsgericht Koeln 
HR B 2168
Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr. 
Karl-Ludwig Kley
Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), Thorsten 
Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef Kayser, Dr. Michael 
Niggemann




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Redudant Ring Network failure

2020-06-09 Thread ROHWEDER-NEUBECK, MICHAEL (EXTERN)
Hi,

we are using unicast ("knet")

Greetings

Michael




Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
Aktiengesellschaft, Koeln, Registereintragung / Registration: Amtsgericht Koeln 
HR B 2168
Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr. 
Karl-Ludwig Kley
Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), Thorsten 
Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef Kayser, Dr. Michael 
Niggemann


-Ursprüngliche Nachricht-
Von: Strahil Nikolov  
Gesendet: Dienstag, 9. Juni 2020 19:30
An: Cluster Labs - All topics related to open-source clustering welcomed 
; ROHWEDER-NEUBECK, MICHAEL (EXTERN) 

Betreff: Re: [ClusterLabs] Redudant Ring Network failure

Are you using multicast ?

Best Regards,
Strahil Nikolov

На 9 юни 2020 г. 10:28:25 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL (EXTERN)" 
 написа:
>Hello,
>We have massive problems with the redundant ring operation of our 
>Corosync / pacemaker 3 Node NFS clusters.
>
>Most of the nodes either have an entire ring offline or only 1 node in 
>a ring.
>Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 | Node3
>Ring0 333 Ring 1 33n)
>
>corosync-cfgtool -R don't help
>All nodes are VMs that build the ring together using 2 VLANs.
>Which logs do you need to hopefully help me?
>
>Corosync Cluster Engine, version '3.0.1'
>Copyright (c) 2006-2018 Red Hat, Inc.
>Debian Buster
>
>
>--
>Mit freundlichen Grüßen
>  Michael Rohweder-Neubeck
>
>NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
>D-64291 Darmstadt
>E-Mail:
>m...@nsb-software.de<mailto:m...@nsb-software.de<mailto:mrn@nsb-software.
>de%3cmailto:m...@nsb-software.de>>
>Manager: Van-Hien Nguyen, Jörg Jaspert
>USt-ID: DE 195 703 354; HRB 7131 Amtsgericht Darmstadt
>
>
>
>
>Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
>Aktiengesellschaft, Koeln, Registereintragung / Registration:
>Amtsgericht Koeln HR B 2168
>Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr.
>Karl-Ludwig Kley
>Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), 
>Thorsten Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef 
>Kayser, Dr. Michael Niggemann
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Redudant Ring Network failure

2020-06-09 Thread Strahil Nikolov
It  will  be hard to guess if you are  using sctp or udp/udpu.
If possible  share  the corosync.conf  (you can remove sensitive data,  but  
make it meaningful).

Are you using a firewall ? If yes  check :
1. Node firewall is not blocking  the communication on the specific  interfaces
2. Verify with tcpdump that the heartbeats are received from the remote side.
3. Check for retransmissions or packet loss.

Usually you can find more details in the log specified in corosync.conf or in 
/var/log/messages (and also the journal).

Best Regards,
Strahil Nikolov

На 9 юни 2020 г. 21:11:02 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL (EXTERN)" 
 написа:
>Hi,
>
>we are using unicast ("knet")
>
>Greetings
>
>Michael
>
>
>
>
>Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa
>Aktiengesellschaft, Koeln, Registereintragung / Registration:
>Amtsgericht Koeln HR B 2168
>Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr.
>Karl-Ludwig Kley
>Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman),
>Thorsten Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef
>Kayser, Dr. Michael Niggemann
>
>
>-Ursprüngliche Nachricht-
>Von: Strahil Nikolov  
>Gesendet: Dienstag, 9. Juni 2020 19:30
>An: Cluster Labs - All topics related to open-source clustering
>welcomed ; ROHWEDER-NEUBECK, MICHAEL (EXTERN)
>
>Betreff: Re: [ClusterLabs] Redudant Ring Network failure
>
>Are you using multicast ?
>
>Best Regards,
>Strahil Nikolov
>
>На 9 юни 2020 г. 10:28:25 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL
>(EXTERN)"  написа:
>>Hello,
>>We have massive problems with the redundant ring operation of our 
>>Corosync / pacemaker 3 Node NFS clusters.
>>
>>Most of the nodes either have an entire ring offline or only 1 node in
>
>>a ring.
>>Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 |
>Node3
>>Ring0 333 Ring 1 33n)
>>
>>corosync-cfgtool -R don't help
>>All nodes are VMs that build the ring together using 2 VLANs.
>>Which logs do you need to hopefully help me?
>>
>>Corosync Cluster Engine, version '3.0.1'
>>Copyright (c) 2006-2018 Red Hat, Inc.
>>Debian Buster
>>
>>
>>--
>>Mit freundlichen Grüßen
>>  Michael Rohweder-Neubeck
>>
>>NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
>>D-64291 Darmstadt
>>E-Mail:
>>m...@nsb-software.de<mailto:m...@nsb-software.de<mailto:mrn@nsb-software.
>>de%3cmailto:m...@nsb-software.de>>
>>Manager: Van-Hien Nguyen, Jörg Jaspert
>>USt-ID: DE 195 703 354; HRB 7131 Amtsgericht Darmstadt
>>
>>
>>
>>
>>Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
>>Aktiengesellschaft, Koeln, Registereintragung / Registration:
>>Amtsgericht Koeln HR B 2168
>>Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board:
>Dr.
>>Karl-Ludwig Kley
>>Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), 
>>Thorsten Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef 
>>Kayser, Dr. Michael Niggemann
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Redudant Ring Network failure

2020-06-09 Thread Strahil Nikolov
Are you using multicast ?

Best Regards,
Strahil Nikolov

На 9 юни 2020 г. 10:28:25 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL (EXTERN)" 
 написа:
>Hello,
>We have massive problems with the redundant ring operation of our
>Corosync / pacemaker 3 Node NFS clusters.
>
>Most of the nodes either have an entire ring offline or only 1 node in
>a ring.
>Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 | Node3
>Ring0 333 Ring 1 33n)
>
>corosync-cfgtool -R don't help
>All nodes are VMs that build the ring together using 2 VLANs.
>Which logs do you need to hopefully help me?
>
>Corosync Cluster Engine, version '3.0.1'
>Copyright (c) 2006-2018 Red Hat, Inc.
>Debian Buster
>
>
>--
>Mit freundlichen Grüßen
>  Michael Rohweder-Neubeck
>
>NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
>D-64291 Darmstadt
>E-Mail:
>m...@nsb-software.de>
>Manager: Van-Hien Nguyen, Jörg Jaspert
>USt-ID: DE 195 703 354; HRB 7131 Amtsgericht Darmstadt
>
>
>
>
>Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa
>Aktiengesellschaft, Koeln, Registereintragung / Registration:
>Amtsgericht Koeln HR B 2168
>Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr.
>Karl-Ludwig Kley
>Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman),
>Thorsten Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef
>Kayser, Dr. Michael Niggemann
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/