[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

ASF GitHub Bot (JIRA) Tue, 30 Oct 2018 05:26:23 -0700


    [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668636#comment-16668636
 ]


ASF GitHub Bot commented on CLOUDSTACK-10310:
---------------------------------------------

somejfn commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-434280034
 
 
   On precision about #2.... With primary storage on NFS hard mounts VMs don't
   go read-only (tested with OL5/6/7) and will resume writing to disk once NFS
   server become available again even after a 25 minutes outage
   
   On Tue, Oct 30, 2018 at 5:27 AM Paul Angus <[email protected]> wrote:
   
   > I'll try to summarise the scenario so that we're all trying to fix the
   > same thing...
   >
   >    1. A host cannot write to one of it's primary storage pools.
   >    2. The some of the VMs on that host are on that pool, so their disks
   >    have gone read-only, but the VM is still running.
   >    3. BUT there may be VMs on other primary storage pools that are
   >    absolutely fine
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/cloudstack/pull/2722#issuecomment-434229093>,
   > or mute the thread
   > 
<https://github.com/notifications/unsubscribe-auth/AOflqMo1V8KeDv4PmOM4aahyCozBJMdcks5uqBttgaJpZM4U2Vve>
   > .
   >
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> KVM hosts reboot if there is a short transient storage error
> ------------------------------------------------------------
>
>                 Key: CLOUDSTACK-10310
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
>             Project: CloudStack
>          Issue Type: Improvement
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: KVM
>    Affects Versions: 4.9.0, 4.10.0.0
>            Reporter: Sean Lair
>            Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

Reply via email to