Re: [PVE-User] critical HA problem on a PVE6 cluster

Eneko Lacunza Mon, 11 May 2020 01:41:09 -0700

Hi Hervé,

This seems a network issue. What is the network setup in this cluster?What logs in syslog about corosync and pve-cluster?


Don't enable HA until you have a stable cluster quorum.

Cheers
Eneko

El 11/5/20 a las 10:35, Herve Ballans escribió:

Hi everybody,
I would like to take the opportunity at the beginning of this new weekto ask my issue again.
Has anyone had any idea why a such problem occurred, or is thisproblem really something new ?
Thanks again,
Hervé

On 07/05/2020 18:28, Herve Ballans wrote:
Hi all,

*Cluster info:*

 * 5 nodes (version PVE 6.1-3 at the time the problem occured)
 * Ceph rbd storage (Nautilus)
 * In production since many years with no major issues
 * No specific network problems at the time the problem occured
 * Nodes are on the same date (configured with the same ntp server)

*Symptoms:*
Suddenly, last night (around 7 PM), all nodes of our cluster seems tohave rebooted in the same time with no apparent reasons (I mean, weweren't doing antything on it) !During the reboot, services "Corosync Cluster Engine" and "Proxmox VEreplication runer" failed. After node rebooted, we are obliged tostart those services manually.
Once rebooted with all pve services, some nodes were in HA lrm status: old timestamp - dead? while others were in active status or inwait_for_agent_lock status ?...Nodes switch states regularly...and it loops back and forth as longas we don't change the configuration...
In the same time, pve-ha-crm service got unexpected error, as forexample : "Configuration file'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even thoughthe file exists but on an another node !Such message is probably a consequence of the fencing between nodesdue to the change of status...
*What we have tried until now to stabilize the situation:*
After several investigations and several operations that have failedto solve anything (in particular a complete upgrade to the latest PVEversion 6.1-11),
we finally removed the HA configuration of all the VM.
Since, the state seems to be stabilized although, obviously, it isnot nominal !
Now, all the nodes are in HA lrm status : idle and sometimes switchto old timestamp - dead? state, then come back to idle state.
None of them are in "active" state.
Obviously, quorum status is "no quorum"
It will be noted that, as soon as we try to re-activate the HA statuson the VMs, problem occurs again (nodes reboot!) :(
*Question:*
Have you ever experienced such a problem or do you know a way torestore a correct HA configuration in this case ?
I point out that nodes are currently on version PVE 6.1-11.

I can put some specific logs if useful.

Thanks in advance for your help,
Hervé

_______________________________________________
pve-user mailing list
[email protected]
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
[email protected]
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user



--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

_______________________________________________
pve-user mailing list
[email protected]
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] critical HA problem on a PVE6 cluster

Reply via email to