Marcelo,

Hello.

I have configured corosync with 2 nodes and added a qdevice to help with
the quorum.

On node1 I added firewall rules to block connections from node2 and the
qdevice, trying to simulate a network issue.

Just please make sure to block both incoming and also outgoing packets. Qdevice will handle blocking of just one direction well (because of tcp) and corosync 3.x with knet too. But corosync 2.x has a big problem with "asymmetric" blocking. Also config suggest that multicast is used - please make sure to block also multicast in that case.


The problem I'm having is that one node1 I can see it dropping the
service (the IP), but on node2 it never gets the IP, it is like the qdevice
is not voting.

This is my corosync.conf:

totem {
         version: 2
         cluster_name: cluster1
         token: 3000
         token_retransmits_before_loss_const: 10
         clear_node_high_bit: yes
         crypto_cipher: none
         crypto_hash: none
}
         interface {
                 ringnumber: 0
                 bindnetaddr: X.X.X.X
                 mcastaddr: 239.255.43.2
                 mcastport: 5405
                 ttl: 1
         }
         nodelist{
                 node {
                         ring0_addr: X.X.X.2
                         name: node1.domain.com
                         nodeid: 2
                 }
                 node {
                         ring0_addr: X.X.X.3
                         name: node2.domain.com
                         nodeid: 3
                 }
         }

logging {
     to_logfile: yes
     logfile: /var/log/cluster/corosync.log
     to_syslog: yes
}

#}

quorum {
   provider: corosync_votequorum
   device {
     votes: 1
     model: net
     net {
       tls: off
       host: qdevice.domain.com
       algorithm: lms
     }
     heuristics {
       mode: on
       exec_ping: /usr/bin/ping -q -c 1 "qdevice.domain.com"
     }
   }
}


I'm getting this on the qdevice host (before adding the firewall rules), so
looks like the cluster is properly configured:

pcs qdevice status net --full


Correct. What is the status after blocking is enabled?

QNetd address: *:5403
TLS: Supported (client certificate required)
Connected clients: 2
Connected clusters: 1
Maximum send/receive size: 32768/32768 bytes
Cluster "cluster1":
     Algorithm: LMS
     Tie-breaker: Node with lowest node ID
     Node ID 3:
         Client address: ::ffff:X.X.X.3:59746
         HB interval: 8000ms
         Configured node list: 2, 3
         Ring ID: 2.95d
         Membership node list: 2, 3
         Heuristics: Pass (membership: Pass, regular: Undefined)
         TLS active: No
         Vote: ACK (ACK)
     Node ID 2:
         Client address: ::ffff:X.X.X.2:33944
         HB interval: 8000ms
         Configured node list: 2, 3
         Ring ID: 2.95d
         Membership node list: 2, 3
         Heuristics: Pass (membership: Pass, regular: Undefined)
         TLS active: No
         Vote: ACK (ACK)

These are partial logs on node2 after activating the firewall rules on
node1. These logs repeats all the time until I remove the firewall rules:

Mar 18 12:48:56 [7202] node2.domain.com stonith-ng:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=16): Try again (6)
Mar 18 12:48:56 [7201] node2.domain.com        cib:     info: crm_cs_flush:
Sent 0 CPG messages  (2 remaining, last=87): Try again (6)
Mar 18 12:48:56 [7202] node2.domain.com stonith-ng:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=16): Try again (6)
Mar 18 12:48:56 [7185] node2.domain.com pacemakerd:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=13): Try again (6)
[7177] node2.domain.com corosyncinfo    [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
[7177] node2.domain.com corosyncnotice  [TOTEM ] A new membership
(X.X.X.3:2469) was formed. Members

^^ This is weird. I'm pretty sure something is broken with the way how packets are blocked (or log is incomplete)

[7177] node2.domain.com corosyncwarning [CPG   ] downlist left_list: 0
received
[7177] node2.domain.com corosyncwarning [TOTEM ] Discarding JOIN message
during flush, nodeid=3
Mar 18 12:48:56 [7201] node2.domain.com        cib:     info: crm_cs_flush:
Sent 0 CPG messages  (2 remaining, last=87): Try again (6)
Mar 18 12:48:56 [7202] node2.domain.com stonith-ng:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=16): Try again (6)
Mar 18 12:48:56 [7185] node2.domain.com pacemakerd:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=13): Try again (6)
Mar 18 12:48:56 [7201] node2.domain.com        cib:     info: crm_cs_flush:
Sent 0 CPG messages  (2 remaining, last=87): Try again (6)
Mar 18 12:48:56 [7185] node2.domain.com pacemakerd:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=13): Try again (6)

If it repeats over and over again then it's 99.9% because of way packets are blocked.


Also on node2:

pcs quorum status
Error: Unable to get quorum status: Unable to get node address for nodeid
2: CS_ERR_NOT_EXIST

And these are the logs on the qdevice host:

Mar 18 12:48:50 debug   algo-lms: membership list from node 3 partition
(3.99d)
Mar 18 12:48:50 debug   algo-util: all_ring_ids_match: seen nodeid 2
(client 0x55a99ce070d0) ring_id (2.995)
Mar 18 12:48:50 debug   algo-util: nodeid 2 in our partition has different
ring_id (2.995) to us (3.99d)
Mar 18 12:48:50 debug   algo-lms: nodeid 3: ring ID (3.99d) not unique in
this membership, waiting
Mar 18 12:48:50 debug   Algorithm result vote is Wait for reply
Mar 18 12:48:52 debug   algo-lms: Client 0x55a99cdfe590 (cluster cluster1,
node_id 3) Timer callback
Mar 18 12:48:52 debug   algo-util: all_ring_ids_match: seen nodeid 2
(client 0x55a99ce070d0) ring_id (2.995)
Mar 18 12:48:52 debug   algo-util: nodeid 2 in our partition has different
ring_id (2.995) to us (3.99d)
Mar 18 12:48:52 debug   algo-lms: nodeid 3: ring ID (3.99d) not unique in
this membership, waiting
Mar 18 12:48:52 debug   Algorithm for client ::ffff:X.X.X.3:59762 decided
to reschedule timer and not send vote with value Wait for reply
Mar 18 12:48:53 debug   Client closed connection
Mar 18 12:48:53 debug   Client ::ffff:X.X.X.2:33960 (init_received 1,
cluster cluster1, node_id 2) disconnect
Mar 18 12:48:53 debug   algo-lms: Client 0x55a99ce070d0 (cluster cluster1,
node_id 2) disconnect
Mar 18 12:48:53 info    algo-lms:   server going down 0
Mar 18 12:48:54 debug   algo-lms: Client 0x55a99cdfe590 (cluster cluster1,
node_id 3) Timer callback
Mar 18 12:48:54 debug   algo-util: partition (3.99d) (0x55a99ce07780) has 1
nodes
Mar 18 12:48:54 debug   algo-lms: Only 1 partition. This is votequorum's
problem, not ours
Mar 18 12:48:54 debug   Algorithm for client ::ffff:X.X.X.3:59762 decided
to not reschedule timer and send vote with value ACK
Mar 18 12:48:54 debug   Sending vote info to client ::ffff:X.X.X.3:59762
(cluster cluster1, node_id 3)
Mar 18 12:48:54 debug     msg seq num = 1
Mar 18 12:48:54 debug     vote = ACK
Mar 18 12:48:54 debug   Client ::ffff:X.X.X.3:59762 (cluster cluster1,
node_id 3) replied back to vote info message
Mar 18 12:48:54 debug     msg seq num = 1
Mar 18 12:48:54 debug   algo-lms: Client 0x55a99cdfe590 (cluster cluster1,
node_id 3) replied back to vote info message
Mar 18 12:48:54 debug   Client ::ffff:X.X.X.3:59762 (cluster cluster1,
node_id 3) sent membership node list.
Mar 18 12:48:54 debug     msg seq num = 8
Mar 18 12:48:54 debug     ring id = (3.9a1)
Mar 18 12:48:54 debug     heuristics = Pass
Mar 18 12:48:54 debug     node list:
Mar 18 12:48:54 debug       node_id = 3, data_center_id = 0, node_state =
not set
Mar 18 12:48:54 debug
Mar 18 12:48:54 debug   algo-lms: membership list from node 3 partition
(3.9a1)
Mar 18 12:48:54 debug   algo-util: partition (3.99d) (0x55a99ce073f0) has 1
nodes
Mar 18 12:48:54 debug   algo-lms: Only 1 partition. This is votequorum's
problem, not ours
Mar 18 12:48:54 debug   Algorithm result vote is ACK
Mar 18 12:48:58 debug   Client ::ffff:X.X.X.3:59762 (cluster cluster1,
node_id 3) sent membership node list.
Mar 18 12:48:58 debug     msg seq num = 9
Mar 18 12:48:58 debug     ring id = (3.9a5)
Mar 18 12:48:58 debug     heuristics = Pass
Mar 18 12:48:58 debug     node list:
Mar 18 12:48:58 debug       node_id = 3, data_center_id = 0, node_state =
not set


I'm running it on CentOS7 servers and tried to follow the RH7 official
docs, but I found a few issues there, and a bug that they won't correct,

What issues you've found? Could you please report them so doc team can fix them?

since there is a workaround. In the end, looks like it is working fine,
except for this voting issue.

After lots of time looking for answers on Google, I decided to send a
message here, and hopefully you can help me to fix it (it is probably a
silly mistake).

I would bet it's really way how traffic is blocked.

Regards,
  Honza


Any help will be appreciated.

Thank you.

Marcelo H. Terres <mhter...@gmail.com>
https://www.mundoopensource.com.br
https://twitter.com/mhterres
https://linkedin.com/in/marceloterres



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to