I'm currently using an Ubuntu 18.04 server configuration with netplan. Here you have my current YAML configuration:
# This file describes the network interfaces available on your system # For more information, see netplan(5). network: version: 2 renderer: networkd ethernets: eno1: addresses: [192.168.0.1/24] enp4s0f0: addresses: [192.168.1.1/24] enp5s0f0: {} vlans: vlan.XXX: id: XXX link: enp5s0f0 addresses: [ 10.1.128.5/29 ] gateway4: 10.1.128.1 nameservers: addresses: [ 8.8.8.8, 8.8.4.4 ] search: [ foo.com, bar.com ] vlan.YYY: id: YYY link: enp5s0f0 addresses: [ 10.1.128.5/29 ] So, eno1 and enp4s0f0 are the two ethernet ports connected each other with crossover cables to node2. enp5s0f0 port is used to connect outside/services using vlans defined in the same file. In short, I'm using systemd-networkd default Ubuntu 18 server service for manage networks. Im not detecting any NetworkManager-config-server package in my repository neither. So the only solution that I have left, I suppose, is to test corosync 3.x and see if it works better handling RRP. Thank you for your quick response! 2018-08-23 8:40 GMT+02:00 Jan Friesse <jfrie...@redhat.com>: > David, > > Hello, >> Im getting crazy about this problem, that I expect to resolve here, with >> your help guys: >> >> I have 2 nodes with Corosync redundant ring feature. >> >> Each node has 2 similarly connected/configured NIC's. Both nodes are >> connected each other by two crossover cables. >> > > I believe this is root of the problem. Are you using NetworkManager? If > so, have you installed NetworkManager-config-server? If not, please install > it and test again. > > >> I configured both nodes with rrp mode passive. Everything is working well >> at this point, but when I shutdown 1 node to test failover, and this node >> > returns to be online, corosync is marking the interface as FAULTY and rrp >> > > I believe it's because with crossover cables configuration when other side > is shutdown, NetworkManager detects it and does ifdown of the interface. > And corosync is unable to handle ifdown properly. Ifdown is bad with single > ring, but it's just killer with RRP (127.0.0.1 poisons every node in the > cluster). > > fails to recover the initial state: >> >> 1. Initial scenario: >> >> # corosync-cfgtool -s >> Printing ring status. >> Local node ID 1 >> RING ID 0 >> id = 192.168.0.1 >> status = ring 0 active with no faults >> RING ID 1 >> id = 192.168.1.1 >> status = ring 1 active with no faults >> >> >> 2. When I shutdown the node 2, all continues with no faults. Sometimes the >> ring ID's are bonding with 127.0.0.1 and then bond back to their >> respective >> heartbeat IP. >> > > Again, result of ifdown. > > >> 3. When node 2 is back online: >> >> # corosync-cfgtool -s >> Printing ring status. >> Local node ID 1 >> RING ID 0 >> id = 192.168.0.1 >> status = ring 0 active with no faults >> RING ID 1 >> id = 192.168.1.1 >> status = Marking ringid 1 interface 192.168.1.1 FAULTY >> >> >> # service corosync status >> ● corosync.service - Corosync Cluster Engine >> Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor >> preset: enabled) >> Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s >> ago >> Docs: man:corosync >> man:corosync.conf >> man:corosync_overview >> Main PID: 1439 (corosync) >> Tasks: 2 (limit: 4915) >> CGroup: /system.slice/corosync.service >> └─1439 /usr/sbin/corosync -f >> >> >> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The >> network interface [192.168.0.1] is now up. >> Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface >> [192.168.0.1] is now up. >> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The >> network interface [192.168.1.1] is now up. >> Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface >> [192.168.1.1] is now up. >> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice [TOTEM ] A >> new membership (192.168.0.1:601760) was formed. Members >> Aug 22 14:44:26 node1 corosync[1439]: [TOTEM ] A new membership ( >> 192.168.0.1:601760) was formed. Members >> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice [TOTEM ] A >> new membership (192.168.0.1:601764) was formed. Members joined: 2 >> Aug 22 14:44:32 node1 corosync[1439]: [TOTEM ] A new membership ( >> 192.168.0.1:601764) was formed. Members joined: 2 >> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error [TOTEM ] >> Marking ringid 1 interface 192.168.1.1 FAULTY >> Aug 22 14:44:34 node1 corosync[1439]: [TOTEM ] Marking ringid 1 >> interface >> 192.168.1.1 FAULTY >> >> >> If I execute corosync-cfgtool, clears the faulty error but after some >> seconds return to be FAULTY. >> The only thing that it resolves the problem is to restart de service with >> service corosync restart. >> >> Here you have some of my configuration settings on node 1 (I probed >> already >> to change rrp_mode): >> >> *- corosync.conf* >> >> >> totem { >> version: 2 >> cluster_name: node >> token: 5000 >> token_retransmits_before_loss_const: 10 >> secauth: off >> threads: 0 >> rrp_mode: passive >> nodeid: 1 >> interface { >> ringnumber: 0 >> bindnetaddr: 192.168.0.0 >> #mcastaddr: 226.94.1.1 >> mcastport: 5405 >> broadcast: yes >> } >> interface { >> ringnumber: 1 >> bindnetaddr: 192.168.1.0 >> #mcastaddr: 226.94.1.2 >> mcastport: 5407 >> broadcast: yes >> } >> } >> >> logging { >> fileline: off >> to_stderr: yes >> to_syslog: yes >> to_logfile: yes >> logfile: /var/log/corosync/corosync.log >> debug: off >> timestamp: on >> logger_subsys { >> subsys: AMF >> debug: off >> } >> } >> >> amf { >> mode: disabled >> } >> >> quorum { >> provider: corosync_votequorum >> expected_votes: 2 >> } >> >> nodelist { >> node { >> nodeid: 1 >> ring0_addr: 192.168.0.1 >> ring1_addr: 192.168.1.1 >> } >> >> node { >> nodeid: 2 >> ring0_addr: 192.168.0.2 >> ring1_addr: 192.168.1.2 >> } >> } >> >> aisexec { >> user: root >> group: root >> } >> >> service { >> name: pacemaker >> ver: 1 >> } >> >> >> >> *- /etc/hosts* >> >> >> 127.0.0.1 localhost >> 10.4.172.5 node1.upc.edu node1 >> 10.4.172.6 node2.upc.edu node2 >> >> > So machines have 3 NICs? 2 for corosync/cluster traffic and one for > regular traffic/services/outside world? > > >> Thank you for you help in advance! >> > > To conclude: > - If you are using NetworkManager, try to install > NetworkManager-config-server, it will probably help > - If you are brave enough, try corosync 3.x (current Alpha4 is pretty > stable - actually some other projects gain this stability with SP1 :) ) > that has no RRP but uses knet for support redundant links (up-to 8 links > can be configured) and doesn't have problems with ifdown. > > Honza > > >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> >> > -- *David Tolosa Martínez* Customer Support & Infrastructure UPCnet - Edifici Vèrtex Plaça d'Eusebi Güell, 6, 08034 Barcelona Tel: 934054555 <https://www.upcnet.es> -- INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES: Responsable: UPCNET, Serveis d'Accés a Internet de la Universitat Politècnica de Catalunya, SLU | Finalitat: gestionar els contactes i les relacions professionals i comercials amb els nostres clients i proveïdors | Base legal: consentiment, interès legítim i/o relació contractual | Destinataris: no seran comunicades a tercers excepte per obligació legal | Drets: pots exercir els teus drets d’accés, rectificació i supressió, així com els altres drets reconeguts a la normativa vigent, enviant-nos un missatge a priv...@upcnet.es <mailto:priv...@upcnet.es> | Més informació: consulta la nostra política completa de protecció de dades <https://www.upcnet.es/politica-de-privacitat>. AVÍS DE CONFIDENCIALITAT <https://www.upcnet.es/ca/avis-de-confidencialitat>
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org