Marcelo,
Hello.
I have configured corosync with 2 nodes and added a qdevice to help with
the quorum.
On node1 I added firewall rules to block connections from node2 and the
qdevice, trying to simulate a network issue.
Just please make sure to block both incoming and also outgoing packets.
Qdevice will handle blocking of just one direction well (because of tcp)
and corosync 3.x with knet too. But corosync 2.x has a big problem with
"asymmetric" blocking. Also config suggest that multicast is used -
please make sure to block also multicast in that case.
The problem I'm having is that one node1 I can see it dropping the
service (the IP), but on node2 it never gets the IP, it is like the qdevice
is not voting.
This is my corosync.conf:
totem {
version: 2
cluster_name: cluster1
token: 3000
token_retransmits_before_loss_const: 10
clear_node_high_bit: yes
crypto_cipher: none
crypto_hash: none
}
interface {
ringnumber: 0
bindnetaddr: X.X.X.X
mcastaddr: 239.255.43.2
mcastport: 5405
ttl: 1
}
nodelist{
node {
ring0_addr: X.X.X.2
name: node1.domain.com
nodeid: 2
}
node {
ring0_addr: X.X.X.3
name: node2.domain.com
nodeid: 3
}
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}
#}
quorum {
provider: corosync_votequorum
device {
votes: 1
model: net
net {
tls: off
host: qdevice.domain.com
algorithm: lms
}
heuristics {
mode: on
exec_ping: /usr/bin/ping -q -c 1 "qdevice.domain.com"
}
}
}
I'm getting this on the qdevice host (before adding the firewall rules), so
looks like the cluster is properly configured:
pcs qdevice status net --full
Correct. What is the status after blocking is enabled?
QNetd address: *:5403
TLS: Supported (client certificate required)
Connected clients: 2
Connected clusters: 1
Maximum send/receive size: 32768/32768 bytes
Cluster "cluster1":
Algorithm: LMS
Tie-breaker: Node with lowest node ID
Node ID 3:
Client address: ::ffff:X.X.X.3:59746
HB interval: 8000ms
Configured node list: 2, 3
Ring ID: 2.95d
Membership node list: 2, 3
Heuristics: Pass (membership: Pass, regular: Undefined)
TLS active: No
Vote: ACK (ACK)
Node ID 2:
Client address: ::ffff:X.X.X.2:33944
HB interval: 8000ms
Configured node list: 2, 3
Ring ID: 2.95d
Membership node list: 2, 3
Heuristics: Pass (membership: Pass, regular: Undefined)
TLS active: No
Vote: ACK (ACK)
These are partial logs on node2 after activating the firewall rules on
node1. These logs repeats all the time until I remove the firewall rules:
Mar 18 12:48:56 [7202] node2.domain.com stonith-ng: info: crm_cs_flush:
Sent 0 CPG messages (1 remaining, last=16): Try again (6)
Mar 18 12:48:56 [7201] node2.domain.com cib: info: crm_cs_flush:
Sent 0 CPG messages (2 remaining, last=87): Try again (6)
Mar 18 12:48:56 [7202] node2.domain.com stonith-ng: info: crm_cs_flush:
Sent 0 CPG messages (1 remaining, last=16): Try again (6)
Mar 18 12:48:56 [7185] node2.domain.com pacemakerd: info: crm_cs_flush:
Sent 0 CPG messages (1 remaining, last=13): Try again (6)
[7177] node2.domain.com corosyncinfo [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
[7177] node2.domain.com corosyncnotice [TOTEM ] A new membership
(X.X.X.3:2469) was formed. Members
^^ This is weird. I'm pretty sure something is broken with the way how
packets are blocked (or log is incomplete)
[7177] node2.domain.com corosyncwarning [CPG ] downlist left_list: 0
received
[7177] node2.domain.com corosyncwarning [TOTEM ] Discarding JOIN message
during flush, nodeid=3
Mar 18 12:48:56 [7201] node2.domain.com cib: info: crm_cs_flush:
Sent 0 CPG messages (2 remaining, last=87): Try again (6)
Mar 18 12:48:56 [7202] node2.domain.com stonith-ng: info: crm_cs_flush:
Sent 0 CPG messages (1 remaining, last=16): Try again (6)
Mar 18 12:48:56 [7185] node2.domain.com pacemakerd: info: crm_cs_flush:
Sent 0 CPG messages (1 remaining, last=13): Try again (6)
Mar 18 12:48:56 [7201] node2.domain.com cib: info: crm_cs_flush:
Sent 0 CPG messages (2 remaining, last=87): Try again (6)
Mar 18 12:48:56 [7185] node2.domain.com pacemakerd: info: crm_cs_flush:
Sent 0 CPG messages (1 remaining, last=13): Try again (6)
If it repeats over and over again then it's 99.9% because of way packets
are blocked.
Also on node2:
pcs quorum status
Error: Unable to get quorum status: Unable to get node address for nodeid
2: CS_ERR_NOT_EXIST
And these are the logs on the qdevice host:
Mar 18 12:48:50 debug algo-lms: membership list from node 3 partition
(3.99d)
Mar 18 12:48:50 debug algo-util: all_ring_ids_match: seen nodeid 2
(client 0x55a99ce070d0) ring_id (2.995)
Mar 18 12:48:50 debug algo-util: nodeid 2 in our partition has different
ring_id (2.995) to us (3.99d)
Mar 18 12:48:50 debug algo-lms: nodeid 3: ring ID (3.99d) not unique in
this membership, waiting
Mar 18 12:48:50 debug Algorithm result vote is Wait for reply
Mar 18 12:48:52 debug algo-lms: Client 0x55a99cdfe590 (cluster cluster1,
node_id 3) Timer callback
Mar 18 12:48:52 debug algo-util: all_ring_ids_match: seen nodeid 2
(client 0x55a99ce070d0) ring_id (2.995)
Mar 18 12:48:52 debug algo-util: nodeid 2 in our partition has different
ring_id (2.995) to us (3.99d)
Mar 18 12:48:52 debug algo-lms: nodeid 3: ring ID (3.99d) not unique in
this membership, waiting
Mar 18 12:48:52 debug Algorithm for client ::ffff:X.X.X.3:59762 decided
to reschedule timer and not send vote with value Wait for reply
Mar 18 12:48:53 debug Client closed connection
Mar 18 12:48:53 debug Client ::ffff:X.X.X.2:33960 (init_received 1,
cluster cluster1, node_id 2) disconnect
Mar 18 12:48:53 debug algo-lms: Client 0x55a99ce070d0 (cluster cluster1,
node_id 2) disconnect
Mar 18 12:48:53 info algo-lms: server going down 0
Mar 18 12:48:54 debug algo-lms: Client 0x55a99cdfe590 (cluster cluster1,
node_id 3) Timer callback
Mar 18 12:48:54 debug algo-util: partition (3.99d) (0x55a99ce07780) has 1
nodes
Mar 18 12:48:54 debug algo-lms: Only 1 partition. This is votequorum's
problem, not ours
Mar 18 12:48:54 debug Algorithm for client ::ffff:X.X.X.3:59762 decided
to not reschedule timer and send vote with value ACK
Mar 18 12:48:54 debug Sending vote info to client ::ffff:X.X.X.3:59762
(cluster cluster1, node_id 3)
Mar 18 12:48:54 debug msg seq num = 1
Mar 18 12:48:54 debug vote = ACK
Mar 18 12:48:54 debug Client ::ffff:X.X.X.3:59762 (cluster cluster1,
node_id 3) replied back to vote info message
Mar 18 12:48:54 debug msg seq num = 1
Mar 18 12:48:54 debug algo-lms: Client 0x55a99cdfe590 (cluster cluster1,
node_id 3) replied back to vote info message
Mar 18 12:48:54 debug Client ::ffff:X.X.X.3:59762 (cluster cluster1,
node_id 3) sent membership node list.
Mar 18 12:48:54 debug msg seq num = 8
Mar 18 12:48:54 debug ring id = (3.9a1)
Mar 18 12:48:54 debug heuristics = Pass
Mar 18 12:48:54 debug node list:
Mar 18 12:48:54 debug node_id = 3, data_center_id = 0, node_state =
not set
Mar 18 12:48:54 debug
Mar 18 12:48:54 debug algo-lms: membership list from node 3 partition
(3.9a1)
Mar 18 12:48:54 debug algo-util: partition (3.99d) (0x55a99ce073f0) has 1
nodes
Mar 18 12:48:54 debug algo-lms: Only 1 partition. This is votequorum's
problem, not ours
Mar 18 12:48:54 debug Algorithm result vote is ACK
Mar 18 12:48:58 debug Client ::ffff:X.X.X.3:59762 (cluster cluster1,
node_id 3) sent membership node list.
Mar 18 12:48:58 debug msg seq num = 9
Mar 18 12:48:58 debug ring id = (3.9a5)
Mar 18 12:48:58 debug heuristics = Pass
Mar 18 12:48:58 debug node list:
Mar 18 12:48:58 debug node_id = 3, data_center_id = 0, node_state =
not set
I'm running it on CentOS7 servers and tried to follow the RH7 official
docs, but I found a few issues there, and a bug that they won't correct,
What issues you've found? Could you please report them so doc team can
fix them?
since there is a workaround. In the end, looks like it is working fine,
except for this voting issue.
After lots of time looking for answers on Google, I decided to send a
message here, and hopefully you can help me to fix it (it is probably a
silly mistake).
I would bet it's really way how traffic is blocked.
Regards,
Honza
Any help will be appreciated.
Thank you.
Marcelo H. Terres <mhter...@gmail.com>
https://www.mundoopensource.com.br
https://twitter.com/mhterres
https://linkedin.com/in/marceloterres
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/