Hello. I have configured corosync with 2 nodes and added a qdevice to help with the quorum.
On node1 I added firewall rules to block connections from node2 and the qdevice, trying to simulate a network issue. The problem I'm having is that one node1 I can see it dropping the service (the IP), but on node2 it never gets the IP, it is like the qdevice is not voting. This is my corosync.conf: totem { version: 2 cluster_name: cluster1 token: 3000 token_retransmits_before_loss_const: 10 clear_node_high_bit: yes crypto_cipher: none crypto_hash: none } interface { ringnumber: 0 bindnetaddr: X.X.X.X mcastaddr: 239.255.43.2 mcastport: 5405 ttl: 1 } nodelist{ node { ring0_addr: X.X.X.2 name: node1.domain.com nodeid: 2 } node { ring0_addr: X.X.X.3 name: node2.domain.com nodeid: 3 } } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes } #} quorum { provider: corosync_votequorum device { votes: 1 model: net net { tls: off host: qdevice.domain.com algorithm: lms } heuristics { mode: on exec_ping: /usr/bin/ping -q -c 1 "qdevice.domain.com" } } } I'm getting this on the qdevice host (before adding the firewall rules), so looks like the cluster is properly configured: pcs qdevice status net --full QNetd address: *:5403 TLS: Supported (client certificate required) Connected clients: 2 Connected clusters: 1 Maximum send/receive size: 32768/32768 bytes Cluster "cluster1": Algorithm: LMS Tie-breaker: Node with lowest node ID Node ID 3: Client address: ::ffff:X.X.X.3:59746 HB interval: 8000ms Configured node list: 2, 3 Ring ID: 2.95d Membership node list: 2, 3 Heuristics: Pass (membership: Pass, regular: Undefined) TLS active: No Vote: ACK (ACK) Node ID 2: Client address: ::ffff:X.X.X.2:33944 HB interval: 8000ms Configured node list: 2, 3 Ring ID: 2.95d Membership node list: 2, 3 Heuristics: Pass (membership: Pass, regular: Undefined) TLS active: No Vote: ACK (ACK) These are partial logs on node2 after activating the firewall rules on node1. These logs repeats all the time until I remove the firewall rules: Mar 18 12:48:56 [7202] node2.domain.com stonith-ng: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=16): Try again (6) Mar 18 12:48:56 [7201] node2.domain.com cib: info: crm_cs_flush: Sent 0 CPG messages (2 remaining, last=87): Try again (6) Mar 18 12:48:56 [7202] node2.domain.com stonith-ng: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=16): Try again (6) Mar 18 12:48:56 [7185] node2.domain.com pacemakerd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=13): Try again (6) [7177] node2.domain.com corosyncinfo [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) [7177] node2.domain.com corosyncnotice [TOTEM ] A new membership (X.X.X.3:2469) was formed. Members [7177] node2.domain.com corosyncwarning [CPG ] downlist left_list: 0 received [7177] node2.domain.com corosyncwarning [TOTEM ] Discarding JOIN message during flush, nodeid=3 Mar 18 12:48:56 [7201] node2.domain.com cib: info: crm_cs_flush: Sent 0 CPG messages (2 remaining, last=87): Try again (6) Mar 18 12:48:56 [7202] node2.domain.com stonith-ng: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=16): Try again (6) Mar 18 12:48:56 [7185] node2.domain.com pacemakerd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=13): Try again (6) Mar 18 12:48:56 [7201] node2.domain.com cib: info: crm_cs_flush: Sent 0 CPG messages (2 remaining, last=87): Try again (6) Mar 18 12:48:56 [7185] node2.domain.com pacemakerd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=13): Try again (6) Also on node2: pcs quorum status Error: Unable to get quorum status: Unable to get node address for nodeid 2: CS_ERR_NOT_EXIST And these are the logs on the qdevice host: Mar 18 12:48:50 debug algo-lms: membership list from node 3 partition (3.99d) Mar 18 12:48:50 debug algo-util: all_ring_ids_match: seen nodeid 2 (client 0x55a99ce070d0) ring_id (2.995) Mar 18 12:48:50 debug algo-util: nodeid 2 in our partition has different ring_id (2.995) to us (3.99d) Mar 18 12:48:50 debug algo-lms: nodeid 3: ring ID (3.99d) not unique in this membership, waiting Mar 18 12:48:50 debug Algorithm result vote is Wait for reply Mar 18 12:48:52 debug algo-lms: Client 0x55a99cdfe590 (cluster cluster1, node_id 3) Timer callback Mar 18 12:48:52 debug algo-util: all_ring_ids_match: seen nodeid 2 (client 0x55a99ce070d0) ring_id (2.995) Mar 18 12:48:52 debug algo-util: nodeid 2 in our partition has different ring_id (2.995) to us (3.99d) Mar 18 12:48:52 debug algo-lms: nodeid 3: ring ID (3.99d) not unique in this membership, waiting Mar 18 12:48:52 debug Algorithm for client ::ffff:X.X.X.3:59762 decided to reschedule timer and not send vote with value Wait for reply Mar 18 12:48:53 debug Client closed connection Mar 18 12:48:53 debug Client ::ffff:X.X.X.2:33960 (init_received 1, cluster cluster1, node_id 2) disconnect Mar 18 12:48:53 debug algo-lms: Client 0x55a99ce070d0 (cluster cluster1, node_id 2) disconnect Mar 18 12:48:53 info algo-lms: server going down 0 Mar 18 12:48:54 debug algo-lms: Client 0x55a99cdfe590 (cluster cluster1, node_id 3) Timer callback Mar 18 12:48:54 debug algo-util: partition (3.99d) (0x55a99ce07780) has 1 nodes Mar 18 12:48:54 debug algo-lms: Only 1 partition. This is votequorum's problem, not ours Mar 18 12:48:54 debug Algorithm for client ::ffff:X.X.X.3:59762 decided to not reschedule timer and send vote with value ACK Mar 18 12:48:54 debug Sending vote info to client ::ffff:X.X.X.3:59762 (cluster cluster1, node_id 3) Mar 18 12:48:54 debug msg seq num = 1 Mar 18 12:48:54 debug vote = ACK Mar 18 12:48:54 debug Client ::ffff:X.X.X.3:59762 (cluster cluster1, node_id 3) replied back to vote info message Mar 18 12:48:54 debug msg seq num = 1 Mar 18 12:48:54 debug algo-lms: Client 0x55a99cdfe590 (cluster cluster1, node_id 3) replied back to vote info message Mar 18 12:48:54 debug Client ::ffff:X.X.X.3:59762 (cluster cluster1, node_id 3) sent membership node list. Mar 18 12:48:54 debug msg seq num = 8 Mar 18 12:48:54 debug ring id = (3.9a1) Mar 18 12:48:54 debug heuristics = Pass Mar 18 12:48:54 debug node list: Mar 18 12:48:54 debug node_id = 3, data_center_id = 0, node_state = not set Mar 18 12:48:54 debug Mar 18 12:48:54 debug algo-lms: membership list from node 3 partition (3.9a1) Mar 18 12:48:54 debug algo-util: partition (3.99d) (0x55a99ce073f0) has 1 nodes Mar 18 12:48:54 debug algo-lms: Only 1 partition. This is votequorum's problem, not ours Mar 18 12:48:54 debug Algorithm result vote is ACK Mar 18 12:48:58 debug Client ::ffff:X.X.X.3:59762 (cluster cluster1, node_id 3) sent membership node list. Mar 18 12:48:58 debug msg seq num = 9 Mar 18 12:48:58 debug ring id = (3.9a5) Mar 18 12:48:58 debug heuristics = Pass Mar 18 12:48:58 debug node list: Mar 18 12:48:58 debug node_id = 3, data_center_id = 0, node_state = not set I'm running it on CentOS7 servers and tried to follow the RH7 official docs, but I found a few issues there, and a bug that they won't correct, since there is a workaround. In the end, looks like it is working fine, except for this voting issue. After lots of time looking for answers on Google, I decided to send a message here, and hopefully you can help me to fix it (it is probably a silly mistake). Any help will be appreciated. Thank you. Marcelo H. Terres <mhter...@gmail.com> https://www.mundoopensource.com.br https://twitter.com/mhterres https://linkedin.com/in/marceloterres
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/