GitHub user btzq added a comment to the discussion: HA not working in
cloudstack 4.19
Hey @Aashiqps , we are facing the same situation as well. Great to have found
another Linstor user in dissagregated architecture.
At this moment, we managed to get Volume Snapshots to work fine, after latest
fixes from Linbit side.
But we still the failover issues, where we cant seem to get all the Virtual
Routers to start up successfully during a node failure (we simulated by pulling
the power from the server). If the Virtual Router cant start up, then the VMs
in that network cant continue to start up as well.
We tested this using NFS Too just to isolate the network issue and the VM HA
using NFS works just fine.
When we go through the logs, we cant seem to identify whats the problem.
Our latest findings is that according to the logs, the VR was successfully
migrated to the new host, and its status transitioned to running. However, the
'ACS HighAvailabilityManager' triggered a stop/reboot action on the router and
did not take any action to start it afterward.
`2024-07-16 17:31:18,853 DEBUG [c.c.v.VmWorkJobDispatcher]
(Work-Job-Executor-137:ctx-b555a5f3 job-385919/job-386245) (logid:1dc1e938) Run
VM work job: com.cloud.vm.VmWorkStop for VM 54572, job origin: 385919
2024-07-16 17:31:18,854 DEBUG [c.c.v.VmWorkJobHandlerProxy]
(Work-Job-Executor-137:ctx-b555a5f3 job-385919/job-386245 ctx-10b406cf)
(logid:1dc1e938) Execute VM work job:
com.cloud.vm.VmWorkStop{"cleanup":true,"userId":1,"accountId":1,"vmId":54572,"handlerName":"VirtualMachineManagerImpl"}
2024-07-16 17:31:18,867 DEBUG [c.c.c.CapacityManagerImpl]
(Work-Job-Executor-137:ctx-b555a5f3 job-385919/job-386245 ctx-10b406cf)
(logid:1dc1e938) VM instance
{"id":54572,"instanceName":"r-54572-VM","type":"DomainRouter","uuid":"a0022aec-996e-490d-8a21-3eccd43c9e0b"}
state transited from [Running] to [Stopping] with event [StopRequested]. VM's
original host: Host
{"id":129,"name":"n2ncloudmy1cp02","type":"Routing","uuid":"3bf16d9d-e561-4e59-b855-7256bee35c6f"},
new host: Host
{"id":129,"name":"n2ncloudmy1cp02","type":"Routing","uuid":"3bf16d9d-e561-4e59-b855-7256bee35c6f"},
host before state transition: Host
{"id":129,"name":"n2ncloudmy1cp02","type":"Routing","uuid":"3bf16d9d-e561-4e59-b855-7256bee35c6f"}
2024-07-16 17:31:37,947 INFO [c.c.h.HighAvailabilityManagerImpl]
(HA-Worker-42:ctx-cfc0e084 work-103139) (logid:fe9b0f9a) VM
VM instance
{"id":54572,"instanceName":"r-54572-VM","type":"DomainRouter","uuid":"a0022aec-996e-490d-8a21-3eccd43c9e0b"}
is no
w no longer on host 129
2024-07-16 17:31:37,947 INFO [c.c.h.HighAvailabilityManagerImpl]
(HA-Worker-42:ctx-cfc0e084 work-103139) (logid:fe9b0f9a) Completed work
HAWork[103139-HA-54572-Stopped-Investigating]. Took 1/10 attempts.`
But there are few things to take note:
- Theres no need to use IMPI OOB. Infact, users are asked not to. This is cause
only NFS iSCSI storage are susceptible to splitbrain, but in Linstor,
apparently the technology is different which is why splitbrain will not occur.
- Need to update Cloudstack Agent to not restart the server.
- HA Strategy in Cloudstack + Linstor is to rely solely on VM HA (not Host HA).
More info here:
https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#ch-cloudstack:~:text=video%20here.-,14.9.%20High%20Availability%20and%20LINSTOR%20Volumes%20in%20CloudStack,-The%20CloudStack%20documentation
https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#ch-cloudstack:~:text=14.9.1.%20Explanation%20and%20Reasoning
Im curious to know your progress and if you managed to find any solution to it?
Happy to communicate to help each other out.
GitHub link:
https://github.com/apache/cloudstack/discussions/9362#discussioncomment-10070156
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]