Re: [ClusterLabs] Redundant ring not recovering after node is back

Jan Friesse Wed, 22 Aug 2018 23:41:29 -0700

David,

Hello,
Im getting crazy about this problem, that I expect to resolve here, with
your help guys:


I have 2 nodes with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. Both nodes are
connected each other by two crossover cables.

I believe this is root of the problem. Are you using NetworkManager? Ifso, have you installed NetworkManager-config-server? If not, pleaseinstall it and test again.


I configured both nodes with rrp mode passive. Everything is working well
at this point, but when I shutdown 1 node to test failover, and this node > 
returns to be online, corosync is marking the interface as FAULTY and rrp

I believe it's because with crossover cables configuration when otherside is shutdown, NetworkManager detects it and does ifdown of theinterface. And corosync is unable to handle ifdown properly. Ifdown isbad with single ring, but it's just killer with RRP (127.0.0.1 poisonsevery node in the cluster).

fails to recover the initial state:

1. Initial scenario:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
         id      = 192.168.0.1
         status  = ring 0 active with no faults
RING ID 1
         id      = 192.168.1.1
         status  = ring 1 active with no faults


2. When I shutdown the node 2, all continues with no faults. Sometimes the
ring ID's are bonding with 127.0.0.1 and then bond back to their respective
heartbeat IP.


Again, result of ifdown.


3. When node 2 is back online:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
         id      = 192.168.0.1
         status  = ring 0 active with no faults
RING ID 1
         id      = 192.168.1.1
         status  = Marking ringid 1 interface 192.168.1.1 FAULTY


# service corosync status
● corosync.service - Corosync Cluster Engine
    Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
preset: enabled)
    Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
      Docs: man:corosync
            man:corosync.conf
            man:corosync_overview
  Main PID: 1439 (corosync)
     Tasks: 2 (limit: 4915)
    CGroup: /system.slice/corosync.service
            └─1439 /usr/sbin/corosync -f


Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.1.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.1.1] is now up.
Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
new membership (192.168.0.1:601760) was formed. Members
Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601760) was formed. Members
Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
new membership (192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
Marking ringid 1 interface 192.168.1.1 FAULTY
Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1 interface
192.168.1.1 FAULTY


If I execute corosync-cfgtool, clears the faulty error but after some
seconds return to be FAULTY.
The only thing that it resolves the problem is to restart de service with
service corosync restart.

Here you have some of my configuration settings on node 1 (I probed already
to change rrp_mode):

*- corosync.conf*

totem {
         version: 2
         cluster_name: node
         token: 5000
         token_retransmits_before_loss_const: 10
         secauth: off
         threads: 0
         rrp_mode: passive
         nodeid: 1
         interface {
                 ringnumber: 0
                 bindnetaddr: 192.168.0.0
                 #mcastaddr: 226.94.1.1
                 mcastport: 5405
                 broadcast: yes
         }
         interface {
                 ringnumber: 1
                 bindnetaddr: 192.168.1.0
                 #mcastaddr: 226.94.1.2
                 mcastport: 5407
                 broadcast: yes
         }
}

logging {
         fileline: off
         to_stderr: yes
         to_syslog: yes
         to_logfile: yes
         logfile: /var/log/corosync/corosync.log
         debug: off
         timestamp: on
         logger_subsys {
                 subsys: AMF
                 debug: off
         }
}

amf {
         mode: disabled
}

quorum {
         provider: corosync_votequorum
         expected_votes: 2
}

nodelist {
         node {
                 nodeid: 1
                 ring0_addr: 192.168.0.1
                 ring1_addr: 192.168.1.1
         }

         node {
                 nodeid: 2
                 ring0_addr: 192.168.0.2
                 ring1_addr: 192.168.1.2
         }
}

aisexec {
         user: root
         group: root
}

service {
         name: pacemaker
         ver: 1
}



*- /etc/hosts*


127.0.0.1       localhost
10.4.172.5      node1.upc.edu node1
10.4.172.6      node2.upc.edu node2

So machines have 3 NICs? 2 for corosync/cluster traffic and one forregular traffic/services/outside world?


Thank you for you help in advance!


To conclude:

- If you are using NetworkManager, try to installNetworkManager-config-server, it will probably help- If you are brave enough, try corosync 3.x (current Alpha4 is prettystable - actually some other projects gain this stability with SP1 :) )that has no RRP but uses knet for support redundant links (up-to 8 linkscan be configured) and doesn't have problems with ifdown.


Honza




_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Redundant ring not recovering after node is back

Reply via email to