Martin Schlegel napsal(a):
Hello everyone,

we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple of
months successfully and we have started seeing a faulty ring with unexpected
  127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".

This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with RRP it means BIG problem.


We have had this once before and only restarting Corosync (and everything else)
on the node showing the unexpected 127.0.0.1 binding made the problem go away.
However, in production we obviously would like to avoid this if possible.

Just don't do ifdown. Never. If you are using NetworkManager (which does ifdown by default if cable is disconnected), use something like NetworkManager-config-server package (it's just change of configuration so you can adopt it to whatever distribution you are using).

Regards,
  Honza


So from the following description - how can I troubleshoot this issue and/or
does anybody have a good idea what might be happening here ?

We run 2x passive rrp rings across different IP-subnets via udpu and we get the
following output (all IPs obfuscated) - please notice the unexpected interface
binding 127.0.0.1 for host pg2.

If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1 briefly
shows "no faults" but goes back to "FAULTY" seconds later.

Regards,
Martin Schlegel
_____________________________________

root@pg1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
         id      = A.B.C1.5
         status  = ring 0 active with no faults
RING ID 1
         id      = D.E.F1.170
         status  = Marking ringid 1 interface D.E.F1.170 FAULTY

root@pg2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
         id      = A.B.C2.88
         status  = ring 0 active with no faults
RING ID 1
         id      = 127.0.0.1
         status  = Marking ringid 1 interface 127.0.0.1 FAULTY

root@pg3:~# corosync-cfgtool -s
Printing ring status.
Local node ID 3
RING ID 0
         id      = A.B.C3.236
         status  = ring 0 active with no faults
RING ID 1
         id      = D.E.F3.112
         status  = Marking ringid 1 interface D.E.F3.112 FAULTY


_____________________________________


/etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
IPs, but are otherwise identical:
===========================================
quorum {
     provider: corosync_votequorum
     expected_votes: 3
}

totem {
         version: 2

         crypto_cipher: none
         crypto_hash: none

         rrp_mode: passive
         interface {
                 ringnumber: 0
                 bindnetaddr: A.B.C1.0
                 mcastport: 5405
                 ttl: 1
         }
         interface {
                 ringnumber: 1
                 bindnetaddr: D.E.F1.64
                 mcastport: 5405
                 ttl: 1
         }
         transport: udpu
}

nodelist {
         node {
                 ring0_addr: pg1
                 ring1_addr: pg1p
                 nodeid: 1
         }
         node {
                 ring0_addr: pg2
                 ring1_addr: pg2p
                 nodeid: 2
         }
         node {
                 ring0_addr: pg3
                 ring1_addr: pg3p
                 nodeid: 3
         }
}

logging {
     to_syslog: yes
}

===========================================

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to