David,

Hello,
Im getting crazy about this problem, that I expect to resolve here, with
your help guys:

I have 2 nodes with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. Both nodes are
connected each other by two crossover cables.

I believe this is root of the problem. Are you using NetworkManager? If so, have you installed NetworkManager-config-server? If not, please install it and test again.


I configured both nodes with rrp mode passive. Everything is working well
at this point, but when I shutdown 1 node to test failover, and this node > 
returns to be online, corosync is marking the interface as FAULTY and rrp

I believe it's because with crossover cables configuration when other side is shutdown, NetworkManager detects it and does ifdown of the interface. And corosync is unable to handle ifdown properly. Ifdown is bad with single ring, but it's just killer with RRP (127.0.0.1 poisons every node in the cluster).

fails to recover the initial state:

1. Initial scenario:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
         id      = 192.168.0.1
         status  = ring 0 active with no faults
RING ID 1
         id      = 192.168.1.1
         status  = ring 1 active with no faults


2. When I shutdown the node 2, all continues with no faults. Sometimes the
ring ID's are bonding with 127.0.0.1 and then bond back to their respective
heartbeat IP.

Again, result of ifdown.


3. When node 2 is back online:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
         id      = 192.168.0.1
         status  = ring 0 active with no faults
RING ID 1
         id      = 192.168.1.1
         status  = Marking ringid 1 interface 192.168.1.1 FAULTY


# service corosync status
● corosync.service - Corosync Cluster Engine
    Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
preset: enabled)
    Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
      Docs: man:corosync
            man:corosync.conf
            man:corosync_overview
  Main PID: 1439 (corosync)
     Tasks: 2 (limit: 4915)
    CGroup: /system.slice/corosync.service
            └─1439 /usr/sbin/corosync -f


Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.1.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.1.1] is now up.
Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
new membership (192.168.0.1:601760) was formed. Members
Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601760) was formed. Members
Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
new membership (192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
Marking ringid 1 interface 192.168.1.1 FAULTY
Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1 interface
192.168.1.1 FAULTY


If I execute corosync-cfgtool, clears the faulty error but after some
seconds return to be FAULTY.
The only thing that it resolves the problem is to restart de service with
service corosync restart.

Here you have some of my configuration settings on node 1 (I probed already
to change rrp_mode):

*- corosync.conf*

totem {
         version: 2
         cluster_name: node
         token: 5000
         token_retransmits_before_loss_const: 10
         secauth: off
         threads: 0
         rrp_mode: passive
         nodeid: 1
         interface {
                 ringnumber: 0
                 bindnetaddr: 192.168.0.0
                 #mcastaddr: 226.94.1.1
                 mcastport: 5405
                 broadcast: yes
         }
         interface {
                 ringnumber: 1
                 bindnetaddr: 192.168.1.0
                 #mcastaddr: 226.94.1.2
                 mcastport: 5407
                 broadcast: yes
         }
}

logging {
         fileline: off
         to_stderr: yes
         to_syslog: yes
         to_logfile: yes
         logfile: /var/log/corosync/corosync.log
         debug: off
         timestamp: on
         logger_subsys {
                 subsys: AMF
                 debug: off
         }
}

amf {
         mode: disabled
}

quorum {
         provider: corosync_votequorum
         expected_votes: 2
}

nodelist {
         node {
                 nodeid: 1
                 ring0_addr: 192.168.0.1
                 ring1_addr: 192.168.1.1
         }

         node {
                 nodeid: 2
                 ring0_addr: 192.168.0.2
                 ring1_addr: 192.168.1.2
         }
}

aisexec {
         user: root
         group: root
}

service {
         name: pacemaker
         ver: 1
}



*- /etc/hosts*


127.0.0.1       localhost
10.4.172.5      node1.upc.edu node1
10.4.172.6      node2.upc.edu node2


So machines have 3 NICs? 2 for corosync/cluster traffic and one for regular traffic/services/outside world?


Thank you for you help in advance!

To conclude:
- If you are using NetworkManager, try to install NetworkManager-config-server, it will probably help - If you are brave enough, try corosync 3.x (current Alpha4 is pretty stable - actually some other projects gain this stability with SP1 :) ) that has no RRP but uses knet for support redundant links (up-to 8 links can be configured) and doesn't have problems with ifdown.

Honza




_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to