Re: [ClusterLabs] [EXT] Re: Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

Windl, Ulrich Mon, 06 May 2024 23:41:29 -0700

Hi!

On " First of all, there no fencing at all, it is off." Maybe the default 
configuration should involve a fencing agent that sends an SMS like this to all 
admins:


"Hey, get out of the bed and drive to work: nodeX has to be reset to continue 
working. You get this message, because you didn't configure fencing!"

😉 Sorry, I couldn't resist, but IMHO clusters are exactly useful to let admins 
sleep at night.

Ulrich

-----Original Message-----
From: Users <users-boun...@clusterlabs.org> On Behalf Of ale...@pavlyuts.ru
Sent: Thursday, May 2, 2024 9:57 PM
To: 'Cluster Labs - All topics related to open-source clustering welcomed' 
<users@clusterlabs.org>
Subject: [EXT] Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice 
connenction disrupted.

Dear Ken, 

First of all, there no fencing at all, it is off.

Thanks great for your suggestion, probably I need to think about this way too, 
however, the project environment is not a good one to rely on fencing and, 
moreover, we can't control the bottom layer a trusted way.

As I understand, fence_xvm just kills VM that not inside the quorum part, or, 
in a case of two-host just one survive who shoot first. But my goal is to keep 
the app from moves (e.g. restarts) as long as possible. This means only two 
kinds of moves accepted: current host fail (move to other with restart) or 
admin move (managed move at certain time with restart). Any other troubles 
should NOT trigger app down/restart. Except of total connectivity loss where no 
second node, no arbiter => stop service.

AFAIK, fencing in two-nodes creates undetermined fence racing, and even it 
warrants only one node survive, it has no respect to if the app already runs on 
the node or not. So, the situation: one node already run app, while other lost 
its connection to the first, but not to the fence device. And win the race => 
kill current active => app restarts. That's exactly what I am trying to avoid.

Therefore, quorum-based management seems better way for my exact case.

Also, VM fencing rely on the idea that all VMs are inside a well-managed first 
layer cluster with it's own quorum/fencing on place or separate nodes and VMs 
never moved between without careful fencing reconfig. In mu case, I can't be 
sure about both points, I do not manage bottom layer. The max I can do is to 
request that every my MV (node, arbiter) located on different phy node and this 
may protect app from node failure and bring more freedom to get nodes off for 
service. Also, I have to limit overall MV count while there need for multiple 
app instances (VM pairs) running at once and one extra VM as arbiter for all 
them (2*N+1), but not 3-node for each instance (3*N) which could be more 
reasonable for my opinion, but not for one who allocate resources.

Please, mind all the above is from my common sense and quite poor fundamental 
knowledge in clustering. And please be so kind to correct me if I am wrong at 
any point.

Sincerely,

Alex
-----Original Message-----
From: Users <users-boun...@clusterlabs.org> On Behalf Of Ken Gaillot
Sent: Thursday, May 2, 2024 5:55 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>
Subject: Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice 
connenction disrupted.

I don't see fencing times in here -- fencing is absolutely essential.

With the setup you describe, I would drop qdevice. With fencing, quorum is not 
strictly required in a two-node cluster (two_node should be set in 
corosync.conf). You can set priority-fencing-delay to reduce the chance of 
simultaneous fencing. For VMs, you can use fence_xvm, which is extremely quick.

On Thu, 2024-05-02 at 02:56 +0300, ale...@pavlyuts.ru wrote:
> Hi All,
>  
> I am trying to build application-specific 2-node failover cluster 
> using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, knet 
> transport.
>  
> For some reason I can’t use 3-node then I have to use qnetd+qdevice 
> 3.0.1.
>  
> The main goal Is to protect custom app which is not cluster-aware by 
> itself. It is quite stateful, can’t store the state outside memory and 
> take some time to get converged with other parts of the system, then 
> the best scenario is “failover is a restart with same config”, but 
> each unnecessary restart is painful. So, if failover done, app must 
> retain on the backup node until it fail or admin push it back, this 
> work well with stickiness param.
>  
> So, the goal is to detect serving node fail ASAP and restart it ASAP 
> on other node, using DRBD-synced config/data. ASAP means within 5-7 
> sec, not 30 or more.
>  
> I was tried different combinations of timing, and finally got 
> acceptable result within 5 sec for the best case. But! The case is 
> very unstable.
>  
> My setup is a simple: two nodes on VM, and one more VM as arbiter 
> (qnetd), VMs under Proxmox and connected by net via external ethernet 
> switch to get closer to reality where “nodes VM” should locate as VM 
> on different PHY hosts in one rack.
>  
> Then, it was adjusted for faster detect and failover.
> In Corosync, left the token default 1000ms, but add
> “heartbeat_failures_allowed: 3”, this made corosync catch node failure 
> for about 200ms (4x50ms heartbeat).
> Both qnet and qdevice was run with  net_heartbeat_interval_min=200 to 
> allow play with faster hearbeats and detects Also, quorum.device.net 
> has timeout: 500, sync_timeout: 3000, algo:
> LMS.
>  
> The testing is to issue “ate +%M:%S.%N && qm stop 201”, and then check 
> the logs on timestamp when the app started on the “backup”
> host. And, when backup host boot again, the test is to check the logs 
> for the app was not restarted.
>  
> Sometimes switchover work like a charm but sometimes it may delay for 
> dozens of seconds.
> Sometimes when the primary host boot up again, secondary hold quorum 
> well and keep app running, sometimes quorum is lost first (and 
> pacemaker downs the app) and then found and pacemaker get app up 
> again, so unwanted restart happen.
>  
> My investigation shows that the difference between “good” and “bad”
> cases:
>  
> Good case - all the logs clear and reasonable.
>  
> Bad case: qnetd losing connection to second node just after the 
> connection to “failure” node detected and then it may take dozens of 
> seconds to restore it. All this time qdevice trying to connect qnetd 
> and fails:
>  
> Example, host 192.168.100.1 send to failure, 100.2 is failover to:
>  
> From qnetd:
> May 01 23:30:39 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.1:60686 doesn't sent any message during 600ms.
> Disconnecting
> May 01 23:30:39 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.1:60686 (init_received 1, cluster bsc-test- 
> cluster, node_id 1) disconnect May 01 23:30:39 arbiter 
> corosync-qnetd[6338]: algo-lms: Client
> 0x55a6fc6785b0 (cluster bsc-test-cluster, node_id 1) disconnect
> May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms:   server
> going down 0
> >>> This is unexpected down, at normal scenario connection persist
> May 01 23:30:40 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:32790 doesn't sent any message during 600ms.
> Disconnecting
> May 01 23:30:40 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:32790 (init_received 1, cluster bsc-test- 
> cluster, node_id 2) disconnect May 01 23:30:40 arbiter 
> corosync-qnetd[6338]: algo-lms: Client
> 0x55a6fc6363d0 (cluster bsc-test-cluster, node_id 2) disconnect
> May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms:   server
> going down 0
> May 01 23:30:56 arbiter corosync-qnetd[6338]: New client connected
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   cluster name = bsc-
> test-cluster
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   tls started = 0
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   tls peer certificate
> verified = 0
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   node_id = 2
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   pointer =
> 0x55a6fc6363d0
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   addr_str =
> ::ffff:192.168.100.2:57736
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   ring id = (2.801)
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   cluster dump:
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     client =
> ::ffff:192.168.100.2:57736, node_id = 2 May 01 23:30:56 arbiter 
> corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent 
> initial node list.
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   msg seq num = 99
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   Node list:
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     0 node_id = 1,
> data_center_id = 0, node_state = not set
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     1 node_id = 2,
> data_center_id = 0, node_state = not set May 01 23:30:56 arbiter 
> corosync-qnetd[6338]: algo-lms: cluster bsc- test-cluster config_list 
> has 2 nodes May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm 
> result vote is No change May 01 23:30:56 arbiter corosync-qnetd[6338]: 
> Client
> ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent 
> membership node list.
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   msg seq num = 100
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   ring id = (2.801)
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   heuristics =
> Undefined
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   Node list:
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     0 node_id = 2,
> data_center_id = 0, node_state = not set May 01 23:30:56 arbiter 
> corosync-qnetd[6338]:
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: membership 
> list from node 2 partition (2.801) May 01 23:30:56 arbiter 
> corosync-qnetd[6338]: algo-util: partition
> (2.801) (0x55a6fc67e110) has 1 nodes
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1 
> partition. This is votequorum's problem, not ours May 01 23:30:56 
> arbiter corosync-qnetd[6338]: Algorithm result vote is ACK May 01 
> 23:30:56 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent 
> quorum node list.
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   msg seq num = 101
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   quorate = 0
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   Node list:
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     0 node_id = 1,
> data_center_id = 0, node_state = dead
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     1 node_id = 2,
> data_center_id = 0, node_state = member May 01 23:30:56 arbiter 
> corosync-qnetd[6338]:
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: quorum node 
> list from node 2 partition (2.801) May 01 23:30:56 arbiter 
> corosync-qnetd[6338]: algo-util: partition
> (2.801) (0x55a6fc697e70) has 1 nodes
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1 
> partition. This is votequorum's problem, not ours May 01 23:30:56 
> arbiter corosync-qnetd[6338]: Algorithm result vote is ACK May 01 
> 23:30:56 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent 
> quorum node list.
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   msg seq num = 102
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   quorate = 1
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   Node list:
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     0 node_id = 1,
> data_center_id = 0, node_state = dead
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     1 node_id = 2,
> data_center_id = 0, node_state = member May 01 23:30:56 arbiter 
> corosync-qnetd[6338]:
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: quorum node 
> list from node 2 partition (2.801) May 01 23:30:56 arbiter 
> corosync-qnetd[6338]: algo-util: partition
> (2.801) (0x55a6fc669dc0) has 1 nodes
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1 
> partition. This is votequorum's problem, not ours May 01 23:30:56 
> arbiter corosync-qnetd[6338]: Algorithm result vote is ACK May 01 
> 23:30:58 arbiter corosync-qnetd[6338]: Client closed connection May 01 
> 23:30:58 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:57674 (init_received 0, cluster bsc-test- 
> cluster, node_id 0) disconnect
> >> At this point resource start on backup host
>  
> From qdevice:
>  
> May 01 23:30:40 node2 corosync-qdevice[781]: Server closed connection 
> May 01 23:30:40 node2 corosync-qdevice[781]: algo-lms: disconnected.
> quorate = 1, WFA = 0
> May 01 23:30:40 node2 corosync-qdevice[781]: algo-lms: disconnected.
> reason = 22, WFA = 0
> May 01 23:30:40 node2 corosync-qdevice[781]: Algorithm result vote is 
> NACK May 01 23:30:40 node2 corosync-qdevice[781]: Cast vote timer 
> remains scheduled every 250ms voting NACK.
> May 01 23:30:40 node2 corosync-qdevice[781]: Sleeping for 161 ms 
> before reconnect May 01 23:30:40 node2 corosync-qdevice[781]: Trying 
> connect to qnetd server arbiter:5403 (timeout = 400ms) May 01 23:30:41 
> node2 corosync-qdevice[781]: Connect timeout May 01 23:30:41 node2 
> corosync-qdevice[781]: algo-lms: disconnected.
> quorate = 1, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected.
> reason = 27, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is 
> NACK May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer 
> remains scheduled every 250ms voting NACK.
> May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd 
> server arbiter:5403 (timeout = 400ms) May 01 23:30:41 node2 
> corosync-qdevice[781]: Votequorum nodelist notify callback:
> May 01 23:30:41 node2 corosync-qdevice[781]:   Ring_id = (2.801)
> May 01 23:30:41 node2 corosync-qdevice[781]:   Node list (size = 1):
> May 01 23:30:41 node2 corosync-qdevice[781]:     0 nodeid = 2
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to 
> pause cast vote timer and result vote is No change May 01 23:30:41 
> node2 corosync-qdevice[781]: Cast vote timer is now paused.
> May 01 23:30:41 node2 corosync-qdevice[781]: worker:
> qdevice_heuristics_worker_cmd_process_exec: Received exec command with 
> seq_no "24" and timeout "1500"
> May 01 23:30:41 node2 corosync-qdevice[781]: Received heuristics exec 
> result command with seq_no "24" and result "Disabled"
> May 01 23:30:41 node2 corosync-qdevice[781]: Votequorum heuristics 
> exec result callback:
> May 01 23:30:41 node2 corosync-qdevice[781]:   seq_number = 24,
> exec_result = Disabled
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to not 
> send list, result vote is No change and heuristics is Undefined May 01 
> 23:30:41 node2 corosync-qdevice[781]: Cast vote timer is no longer 
> paused.
> May 01 23:30:41 node2 corosync-qdevice[781]: Not scheduling heuristics 
> timer because mode is not enabled May 01 23:30:41 node2 
> corosync-qdevice[781]: Votequorum quorum notify
> callback:
> May 01 23:30:41 node2 corosync-qdevice[781]:   Quorate = 0
> May 01 23:30:41 node2 corosync-qdevice[781]:   Node list (size = 3):
> May 01 23:30:41 node2 corosync-qdevice[781]:     0 nodeid = 1, state
> = 2
> May 01 23:30:41 node2 corosync-qdevice[781]:     1 nodeid = 2, state
> = 1
> May 01 23:30:41 node2 corosync-qdevice[781]:     2 nodeid = 0, state
> = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: quorum_notify.
> quorate = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to not 
> send list and result vote is No change May 01 23:30:41 node2 
> corosync-qdevice[781]: Connect timeout May 01 23:30:41 node2 
> corosync-qdevice[781]: algo-lms: disconnected.
> quorate = 0, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected.
> reason = 27, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is 
> NACK May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer 
> remains scheduled every 250ms voting NACK.
> May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd 
> server arbiter:5403 (timeout = 400ms)
> >>> At this point quorum reported lost
> May 01 23:30:41 node2 corosync-qdevice[781]: Connect timeout May 01 
> 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected.
> quorate = 0, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected.
> reason = 27, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is 
> NACK May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer 
> remains scheduled every 250ms voting NACK.
>  
> >>> This failure pattern repeats 31 times
> May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd 
> server arbiter:5403 (timeout = 400ms) May 01 23:30:42 node2 
> corosync-qdevice[781]: Connect timeout May 01 23:30:42 node2 
> corosync-qdevice[781]: algo-lms: disconnected.
> quorate = 0, WFA = 0
> May 01 23:30:42 node2 corosync-qdevice[781]: algo-lms: disconnected.
> reason = 27, WFA = 0
> May 01 23:30:42 node2 corosync-qdevice[781]: Algorithm result vote is 
> NACK May 01 23:30:42 node2 corosync-qdevice[781]: Cast vote timer 
> remains scheduled every 250ms voting NACK.
> >>> End of pattern repeat, continue
>  
> May 01 23:30:56 node2 corosync-qdevice[781]: Trying connect to qnetd 
> server arbiter:5403 (timeout = 400ms) May 01 23:30:56 node2 
> corosync-qdevice[781]: Sending preinit msg to qnetd May 01 23:30:56 
> node2 corosync-qdevice[781]: Received preinit reply msg May 01 
> 23:30:56 node2 corosync-qdevice[781]: Received init reply msg May 01 
> 23:30:56 node2 corosync-qdevice[781]: Scheduling send of heartbeat 
> every 400ms May 01 23:30:56 node2 corosync-qdevice[781]: Executing 
> after-connect heuristics.
> May 01 23:30:56 node2 corosync-qdevice[781]: worker:
> qdevice_heuristics_worker_cmd_process_exec: Received exec command with 
> seq_no "25" and timeout "250"
> May 01 23:30:56 node2 corosync-qdevice[781]: Received heuristics exec 
> result command with seq_no "25" and result "Disabled"
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm decided to send 
> config node list, send membership node list, send quorum node list, 
> heuristics is Undefined and result vote is Wait for reply May 01 
> 23:30:56 node2 corosync-qdevice[781]: Sending set option seq = 98, 
> HB(0) = 0ms, KAP Tie-breaker(1) = Enabled May 01 23:30:56 node2 
> corosync-qdevice[781]: Sending config node list seq = 99
> May 01 23:30:56 node2 corosync-qdevice[781]:   Node list:
> May 01 23:30:56 node2 corosync-qdevice[781]:     0 node_id = 1,
> data_center_id = 0, node_state = not set
> May 01 23:30:56 node2 corosync-qdevice[781]:     1 node_id = 2,
> data_center_id = 0, node_state = not set May 01 23:30:56 node2 
> corosync-qdevice[781]: Sending membership node list seq = 100, ringid 
> = (2.801), heuristics = Undefined.
> May 01 23:30:56 node2 corosync-qdevice[781]:   Node list:
> May 01 23:30:56 node2 corosync-qdevice[781]:     0 node_id = 2,
> data_center_id = 0, node_state = not set May 01 23:30:56 node2 
> corosync-qdevice[781]: Sending quorum node list seq = 101, quorate = 0
> May 01 23:30:56 node2 corosync-qdevice[781]:   Node list:
> May 01 23:30:56 node2 corosync-qdevice[781]:     0 node_id = 1,
> data_center_id = 0, node_state = dead
> May 01 23:30:56 node2 corosync-qdevice[781]:     1 node_id = 2,
> data_center_id = 0, node_state = member May 01 23:30:56 node2 
> corosync-qdevice[781]: Cast vote timer is now stopped.
> May 01 23:30:56 node2 corosync-qdevice[781]: Received set option reply 
> seq(1) = 98, HB(0) = 0ms, KAP Tie-breaker(1) = Enabled May 01 23:30:56 
> node2 corosync-qdevice[781]: Received initial config node list reply
> May 01 23:30:56 node2 corosync-qdevice[781]:   seq = 99
> May 01 23:30:56 node2 corosync-qdevice[781]:   vote = No change
> May 01 23:30:56 node2 corosync-qdevice[781]:   ring id = (2.801)
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is 
> No change May 01 23:30:56 node2 corosync-qdevice[781]: Received 
> membership node list reply
> May 01 23:30:56 node2 corosync-qdevice[781]:   seq = 100
> May 01 23:30:56 node2 corosync-qdevice[781]:   vote = ACK
> May 01 23:30:56 node2 corosync-qdevice[781]:   ring id = (2.801)
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is 
> ACK May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer is 
> now scheduled every 250ms voting ACK.
> May 01 23:30:56 node2 corosync-qdevice[781]: Received quorum node list 
> reply
> May 01 23:30:56 node2 corosync-qdevice[781]:   seq = 101
> May 01 23:30:56 node2 corosync-qdevice[781]:   vote = ACK
> May 01 23:30:56 node2 corosync-qdevice[781]:   ring id = (2.801)
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is 
> ACK May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer 
> remains scheduled every 250ms voting ACK.
> May 01 23:30:56 node2 corosync-qdevice[781]: Votequorum quorum notify
> callback:
> May 01 23:30:56 node2 corosync-qdevice[781]:   Quorate = 1
> May 01 23:30:56 node2 corosync-qdevice[781]:   Node list (size = 3):
> May 01 23:30:56 node2 corosync-qdevice[781]:     0 nodeid = 1, state
> = 2
> May 01 23:30:56 node2 corosync-qdevice[781]:     1 nodeid = 2, state
> = 1
> May 01 23:30:56 node2 corosync-qdevice[781]:     2 nodeid = 0, state
> = 0
> May 01 23:30:56 node2 corosync-qdevice[781]: algo-lms: quorum_notify.
> quorate = 1
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm decided to send 
> list and result vote is No change May 01 23:30:56 node2 
> corosync-qdevice[781]: Sending quorum node list seq = 102, quorate = 1
> May 01 23:30:56 node2 corosync-qdevice[781]:   Node list:
> May 01 23:30:56 node2 corosync-qdevice[781]:     0 node_id = 1,
> data_center_id = 0, node_state = dead
> May 01 23:30:56 node2 corosync-qdevice[781]:     1 node_id = 2,
> data_center_id = 0, node_state = member May 01 23:30:56 node2 
> corosync-qdevice[781]: Received quorum node list reply
> May 01 23:30:56 node2 corosync-qdevice[781]:   seq = 102
> May 01 23:30:56 node2 corosync-qdevice[781]:   vote = ACK
> May 01 23:30:56 node2 corosync-qdevice[781]:   ring id = (2.801)
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is 
> ACK May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer 
> remains scheduled every 250ms voting ACK.
> >>> Here everything become OK and resource started on Node2
>  
> Also, I’ve done wireshark capture and found great mess in TCP, it 
> seems like connection between qdevice and qnetd really stops for some 
> time and packets won’t deliver.
>  
> For my guess, it match corosync syncing activities, and I suspect that 
> corosync prevent any other traffic on the interface it use for rings.
>  
> As I switch qnetd and qdevice to use different interface it seems to 
> work fine.
>  
> So, the question is: does corosync really temporary blocks any other 
> traffic on the interface it uses? Or it is just a coincidence? If it 
> blocks, is there a way to manage it?
>  
> Thank you for any suggest on that!
>  
> Sincerely,
>  
> Alex
>  
>  
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
--
Ken Gaillot <kgail...@redhat.com>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [EXT] Re: Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

Reply via email to