[
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588669#comment-16588669
]
ASF subversion and git services commented on CLOUDSTACK-10310:
--------------------------------------------------------------
Commit 023dcec5ef2e38091c0aacda1e0fae67fd6c4553 in cloudstack's branch
refs/heads/master from Slair1
[ https://gitbox.apache.org/repos/asf?p=cloudstack.git;h=023dcec ]
CLOUDSTACK-10310 Fix KVM reboot on storage issue (#2722)
> KVM hosts reboot if there is a short transient storage error
> ------------------------------------------------------------
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
> Issue Type: Improvement
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: KVM
> Affects Versions: 4.9.0, 4.10.0.0
> Reporter: Sean Lair
> Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus
> taking down all VMs running on it. The code does try 5x times before the
> reboot, but the there is not a delay between the retires, so they are 5
> simultaneous retries, which doesn't help. Standard SAN storage HA operations
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between
> tries so it isn't 5x almost simultaneous tries. Plus, instead of rebooting,
> the cloudstack-agent could just be stopped on the host instead. This will
> cause alerts to be issued and if the host is disconnected long-enough,
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)