22.08.2018 15:53, David Tolosa пишет: > Hello, > Im getting crazy about this problem, that I expect to resolve here, with > your help guys: > > I have 2 nodes with Corosync redundant ring feature. > > Each node has 2 similarly connected/configured NIC's. Both nodes are > connected each other by two crossover cables. > > I configured both nodes with rrp mode passive. Everything is working well > at this point, but when I shutdown 1 node to test failover, and this node > returns to be online, corosync is marking the interface as FAULTY and rrp > fails to recover the initial state: > > 1. Initial scenario: > > # corosync-cfgtool -s > Printing ring status. > Local node ID 1 > RING ID 0 > id = 192.168.0.1 > status = ring 0 active with no faults > RING ID 1 > id = 192.168.1.1 > status = ring 1 active with no faults > > > 2. When I shutdown the node 2, all continues with no faults. Sometimes the > ring ID's are bonding with 127.0.0.1 and then bond back to their respective > heartbeat IP. > > 3. When node 2 is back online: > > # corosync-cfgtool -s > Printing ring status. > Local node ID 1 > RING ID 0 > id = 192.168.0.1 > status = ring 0 active with no faults > RING ID 1 > id = 192.168.1.1 > status = Marking ringid 1 interface 192.168.1.1 FAULTY > > > # service corosync status > ● corosync.service - Corosync Cluster Engine > Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor > preset: enabled) > Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago > Docs: man:corosync > man:corosync.conf > man:corosync_overview > Main PID: 1439 (corosync) > Tasks: 2 (limit: 4915) > CGroup: /system.slice/corosync.service > └─1439 /usr/sbin/corosync -f > > > Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The > network interface [192.168.0.1] is now up. > Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface > [192.168.0.1] is now up. > Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The > network interface [192.168.1.1] is now up. > Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface > [192.168.1.1] is now up. > Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice [TOTEM ] A > new membership (192.168.0.1:601760) was formed. Members > Aug 22 14:44:26 node1 corosync[1439]: [TOTEM ] A new membership ( > 192.168.0.1:601760) was formed. Members > Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice [TOTEM ] A > new membership (192.168.0.1:601764) was formed. Members joined: 2 > Aug 22 14:44:32 node1 corosync[1439]: [TOTEM ] A new membership ( > 192.168.0.1:601764) was formed. Members joined: 2 > Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error [TOTEM ] > Marking ringid 1 interface 192.168.1.1 FAULTY > Aug 22 14:44:34 node1 corosync[1439]: [TOTEM ] Marking ringid 1 interface > 192.168.1.1 FAULTY > > > If I execute corosync-cfgtool, clears the faulty error but after some > seconds return to be FAULTY. > The only thing that it resolves the problem is to restart de service with > service corosync restart. > > Here you have some of my configuration settings on node 1 (I probed already > to change rrp_mode): > > *- corosync.conf* > > totem { > version: 2 > cluster_name: node > token: 5000 > token_retransmits_before_loss_const: 10 > secauth: off > threads: 0 > rrp_mode: passive > nodeid: 1 > interface { > ringnumber: 0 > bindnetaddr: 192.168.0.0 > #mcastaddr: 226.94.1.1 > mcastport: 5405 > broadcast: yes > } > interface { > ringnumber: 1 > bindnetaddr: 192.168.1.0 > #mcastaddr: 226.94.1.2 > mcastport: 5407 > broadcast: yes > } > } > > logging { > fileline: off > to_stderr: yes > to_syslog: yes > to_logfile: yes > logfile: /var/log/corosync/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > } > } > > amf { > mode: disabled > } > > quorum { > provider: corosync_votequorum > expected_votes: 2 > } > > nodelist { > node { > nodeid: 1 > ring0_addr: 192.168.0.1 > ring1_addr: 192.168.1.1 > } > > node { > nodeid: 2 > ring0_addr: 192.168.0.2 > ring1_addr: 192.168.1.2 > } > } >
My understanding so far was that nodelist is used with udpu transport only. You may try without nodelist or with transport: udpu to see if it makes a difference. > aisexec { > user: root > group: root > } > > service { > name: pacemaker > ver: 1 > } > > > > *- /etc/hosts* > > > 127.0.0.1 localhost > 10.4.172.5 node1.upc.edu node1 > 10.4.172.6 node2.upc.edu node2 > > > Thank you for you help in advance! > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org