subject:"\"\\\\\\\[jira\\\\\\\] \\\\\\\[Commented\\\\\\\] \\\\\\\(CLOUDSTACK\\\\\\\-10310\\\\\\\) KVM hosts reboot if there is a short transient storage error\""

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-30 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668638#comment-16668638
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

somejfn commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-434280872
 
 
   On problem is while NFS is unavailable,  you wont be able to destroy the
   VM libvirt will just hang.  So if you attempt to destroy the and start
   a new VM,  when the NFS service comes back online you will get the
   duplicate VM.  That's why I would rather just wait for the NFS issue to go
   away rather than fire VM-HA in that case.
   
   On Tue, Oct 30, 2018 at 5:32 AM Paul Angus  wrote:
   
   > IMHO i'd say if a VM on that storage is marked as ha-enabled, it should be
   > powered-off and restarted somewhere else, and if it isn't HA enabled, we
   > shouldn't do anything with the running VM (as it's for the user of the VM
   > to deal with it),
   > in either case we should probably set the host to 'alert' so then an admin
   > can see it and do something about it.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > ,
   > or mute the thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-30 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668636#comment-16668636
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

somejfn commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-434280034
 
 
   On precision about #2 With primary storage on NFS hard mounts VMs don't
   go read-only (tested with OL5/6/7) and will resume writing to disk once NFS
   server become available again even after a 25 minutes outage
   
   On Tue, Oct 30, 2018 at 5:27 AM Paul Angus  wrote:
   
   > I'll try to summarise the scenario so that we're all trying to fix the
   > same thing...
   >
   >1. A host cannot write to one of it's primary storage pools.
   >2. The some of the VMs on that host are on that pool, so their disks
   >have gone read-only, but the VM is still running.
   >3. BUT there may be VMs on other primary storage pools that are
   >absolutely fine
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > ,
   > or mute the thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-30 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668418#comment-16668418
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

PaulAngus commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-434230839
 
 
   IMHO i'd say if a VM on that storage is marked as ha-enabled,  it should be 
powered-off and restarted somewhere else, and if it isn't HA enabled, we 
shouldn't do anything with the running VM (as it's for the user of the VM to 
deal with it),
   in either case we should probably set the host to 'alert' so then an admin 
can see it and do something about it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-30 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668413#comment-16668413
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

PaulAngus commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-434229093
 
 
   I'll try to summarise the scenario so that we're all trying to fix the same 
thing...
   
   1. A host cannot write to one of it's primary storage pools.
   2. The some of the VMs on that host are on that pool, so their disks have 
gone read-only, but the VM is still running.
   3. BUT there may be VMs on other primary storage pools that are absolutely 
fine
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-30 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668342#comment-16668342
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

rhtyd commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-434213795
 
 
   I should n't have taken word on this, I tested and could not reproduce the 
behaviour @csquire reported. I think we should revert the change and let the 
KVM host reset. VM disk corruption is worse than VM downtime.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-17 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653834#comment-16653834
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-430704620
 
 
   @somejfn maybe because I have PR #2474 also installed?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-17 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653828#comment-16653828
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

somejfn commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-430702548
 
 
   @Slair1   Testing with virtlockd,  if the host that crashed can't come back 
I had to manually clear the locks for VM HA to fire and restart the VM.  At 
that point it's the same as disabling VM HA and manually restart VM's after a 
KVM host crash.How is your setup different ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-08 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642050#comment-16642050
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

csquire commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-427887578
 
 
   @Slair1 thanks for the feedback, I will run more tests.  The behavior is 
still incorrect, so I went ahead and submitted issue #2890


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-08 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641989#comment-16641989
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-427871103
 
 
   @csquire The VMs should eventually crash themselves since they won't have 
access to their disks.  In our environment when a host gets marked as down we 
get alerted and an engineer can take a look at what is going on.  Yes, we rely 
on VM HA to decide what to do with the problem host also.  We make sure lockd 
is enabled so we don't get a split-brain situation.
   
   https://libvirt.org/locking-lockd.html
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-08 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641975#comment-16641975
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

csquire edited a comment on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-427867205
 
 
   Sorry, I misspoke in my last comment (edited to make it correct). The 
blocked host doesn't reboot, it just gets marked as `Down` and the VMs are 
actually still running on it when duplicate VMs get provisioned. Maybe it's a 
completely separate issue, but will still prevent us from using 4.11 in 
production. EDIT: Actually, looks like it may have been present before 4.11.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-08 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641952#comment-16641952
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

csquire edited a comment on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-427859929
 
 
   This PR doesn't seem to completely fix the problem (or maybe this is a 
completely new problem). We installed the RC release with this PR on a test 
system and are able to get the KVM host to be marked as `Down` by using 
iptables to drop outgoing requests to NFS. My investigation shows that the line 
[`storage = 
conn.storagePoolLookupByUUIDString(uuid);`](https://github.com/apache/cloudstack/blob/4.11/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java#L95)
 blocks indefinitely.  So, `kvmheartbeat.sh` is never executed, a host 
investigation is started, the host with blocked NFS is marked as `Down` and 
finally all VMs on that host are rescheduled and result in duplicate VMs.
   
   I pulled a thread dump and found the KVMHAMonitor thread will hang here 
until NFS is unblocked, didn't dig any deeper yet though.
   
   ```"Thread-20" - Thread t@135
  java.lang.Thread.State: RUNNABLE
   at com.sun.jna.Native.invokePointer(Native Method)
   at com.sun.jna.Function.invokePointer(Function.java:470)
   at com.sun.jna.Function.invoke(Function.java:404)
   at com.sun.jna.Function.invoke(Function.java:315)
   at com.sun.jna.Library$Handler.invoke(Library.java:212)
   at com.sun.proxy.$Proxy3.virStoragePoolLookupByUUIDString(Unknown 
Source)
   at org.libvirt.Connect.storagePoolLookupByUUIDString(Unknown Source)
   at 
com.cloud.hypervisor.kvm.resource.KVMHAMonitor$Monitor.runInContext(KVMHAMonitor.java:95)
   - locked <1afb3370> (a java.util.concurrent.ConcurrentHashMap)
   at 
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
   at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
   at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
   at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
   at 
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
   at java.lang.Thread.run(Thread.java:748)
   
  Locked ownable synchronizers:
   - None```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-08 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641948#comment-16641948
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

csquire commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-427867205
 
 
   Sorry, I misspoke in my last comment (edited to make it correct). The 
blocked host doesn't reboot, it just gets marked as `Down` and the VMs are 
actually still running on it when duplicate VMs get provisioned. Maybe it's a 
completely separate issue, but will still prevent us from using 4.11 in 
production.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-08 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641931#comment-16641931
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

rhtyd commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-427864552
 
 
   Good find @csquire, @Slair1 any comments? /cc @PaulAngus @borisstoyanov 
@DaanHoogland 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-10-08 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641913#comment-16641913
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

csquire commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-427859929
 
 
   This PR doesn't seem to completely fix the reboot problem. We installed the 
RC release with this PR on a test system and are still able to get the KVM host 
to reboot by using iptables to drop outgoing requests to NFS. My investigation 
shows that the line [`storage = 
conn.storagePoolLookupByUUIDString(uuid);`](https://github.com/apache/cloudstack/blob/4.11/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java#L95)
 blocks indefinitely.  So, `kvmheartbeat.sh` is never executed, a host 
investigation is started, the host with blocked NFS is marked as `Down` and 
finally all VMs on that host are rescheduled and result in duplicate VMs.
   
   I pulled a thread dump and found the KVMHAMonitor thread will hang here 
until NFS is unblocked, didn't dig any deeper yet though.
   
   ```"Thread-20" - Thread t@135
  java.lang.Thread.State: RUNNABLE
   at com.sun.jna.Native.invokePointer(Native Method)
   at com.sun.jna.Function.invokePointer(Function.java:470)
   at com.sun.jna.Function.invoke(Function.java:404)
   at com.sun.jna.Function.invoke(Function.java:315)
   at com.sun.jna.Library$Handler.invoke(Library.java:212)
   at com.sun.proxy.$Proxy3.virStoragePoolLookupByUUIDString(Unknown 
Source)
   at org.libvirt.Connect.storagePoolLookupByUUIDString(Unknown Source)
   at 
com.cloud.hypervisor.kvm.resource.KVMHAMonitor$Monitor.runInContext(KVMHAMonitor.java:95)
   - locked <1afb3370> (a java.util.concurrent.ConcurrentHashMap)
   at 
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
   at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
   at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
   at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
   at 
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
   at java.lang.Thread.run(Thread.java:748)
   
  Locked ownable synchronizers:
   - None```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-08-22 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588669#comment-16588669
 ] 

ASF subversion and git services commented on CLOUDSTACK-10310:
--

Commit 023dcec5ef2e38091c0aacda1e0fae67fd6c4553 in cloudstack's branch 
refs/heads/master from Slair1
[ https://gitbox.apache.org/repos/asf?p=cloudstack.git;h=023dcec ]

CLOUDSTACK-10310 Fix KVM reboot on storage issue (#2722)



> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-08-20 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585617#comment-16585617
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

DaanHoogland closed pull request #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java 
b/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
index be5ab396d19..f180848a8d5 100644
--- 
a/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
+++ 
b/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
@@ -34,7 +34,8 @@
 protected static String s_heartBeatPath;
 protected long _heartBeatUpdateTimeout = 6;
 protected long _heartBeatUpdateFreq = 6;
-protected long _heartBeatUpdateMaxRetry = 3;
+protected long _heartBeatUpdateMaxTries = 5;
+protected long _heartBeatUpdateRetrySleep = 15000;
 
 public static enum PoolType {
 PrimaryStorage, SecondaryStorage
diff --git 
a/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java
 
b/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java
index 0cebb4c9b00..8a11b7fc962 100644
--- 
a/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java
+++ 
b/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java
@@ -119,7 +119,8 @@ protected void runInContext() {
 }
 
 String result = null;
-for (int i = 0; i < 5; i++) {
+// Try multiple times, but sleep in between tries to 
ensure it isn't a short lived transient error
+for (int i = 1; i <= _heartBeatUpdateMaxTries; i++) {
 Script cmd = new Script(s_heartBeatPath, 
_heartBeatUpdateTimeout, s_logger);
 cmd.add("-i", primaryStoragePool._poolIp);
 cmd.add("-p", primaryStoragePool._poolMountSourcePath);
@@ -127,14 +128,21 @@ protected void runInContext() {
 cmd.add("-h", _hostIP);
 result = cmd.execute();
 if (result != null) {
-s_logger.warn("write heartbeat failed: " + result 
+ ", retry: " + i);
+s_logger.warn("write heartbeat failed: " + result 
+ ", try: " + i + " of " + _heartBeatUpdateMaxTries);
+try {
+Thread.sleep(_heartBeatUpdateRetrySleep);
+} catch (InterruptedException e) {
+s_logger.debug("[ignored] interupted between 
heartbeat retries.");
+}
 } else {
 break;
 }
 }
 
 if (result != null) {
-s_logger.warn("write heartbeat failed: " + result + "; 
reboot the host");
+// Stop cloudstack-agent if can't write to heartbeat 
file.
+// This will raise an alert on the mgmt server
+s_logger.warn("write heartbeat failed: " + result + "; 
stopping cloudstack-agent");
 Script cmd = new Script(s_heartBeatPath, 
_heartBeatUpdateTimeout, s_logger);
 cmd.add("-i", primaryStoragePool._poolIp);
 cmd.add("-p", primaryStoragePool._poolMountSourcePath);
diff --git a/scripts/vm/hypervisor/kvm/kvmheartbeat.sh 
b/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
index 7c8ee67f30c..30ca72a2aa9 100755
--- a/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
+++ b/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
@@ -155,10 +155,10 @@ then
   exit 0
 elif [ "$cflag" == "1" ]
 then
-  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was 
unable to write the heartbeat to the storage."
+  /usr/bin/logger -t heartbeat "kvmheartbeat.sh stopped cloudstack-agent 
because it was unable to write the heartbeat to the storage."
   sync &
   sleep 5
-  echo b > /proc/sysrq-trigger
+  service cloudstack-agent stop
   exit $?
 else
   write_hbLog 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infr

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-08-20 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585619#comment-16585619
 ] 

ASF subversion and git services commented on CLOUDSTACK-10310:
--

Commit 023dcec5ef2e38091c0aacda1e0fae67fd6c4553 in cloudstack's branch 
refs/heads/4.11 from Slair1
[ https://gitbox.apache.org/repos/asf?p=cloudstack.git;h=023dcec ]

CLOUDSTACK-10310 Fix KVM reboot on storage issue (#2722)



> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-27 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525484#comment-16525484
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

DaanHoogland commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-400797899
 
 
   @Slair1 this is a bug fix. It can go into 4.11 and if we merge we also merge 
forward to master


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-27 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525131#comment-16525131
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-400687544
 
 
   @borisstoyanov I didn't know if there was going to be another 4.11.x release 
or not.  If we prefer, i can rebase to master?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-27 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524689#comment-16524689
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

borisstoyanov commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-400567933
 
 
   @Slair1 why not master instead of 4.11? 
   4.11.1 just went out.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-26 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524246#comment-16524246
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

blueorangutan commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-400475036
 
 
   Trillian test result (tid-2827)
   Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
   Total time taken: 30394 seconds
   Marvin logs: 
https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2722-t2827-kvm-centos7.zip
   Intermitten failure detected: 
/marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
   Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
   Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
   Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
   Intermitten failure detected: /marvin/tests/smoke/test_vpc_vpn.py
   Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
   Smoke tests completed. 63 look OK, 4 have error(s)
   Only failed tests results shown below:
   
   
   Test | Result | Time (s) | Test File
   --- | --- | --- | ---
   ContextSuite context=TestDeployVirtioSCSIVM>:setup | `Error` | 0.00 | 
test_deploy_virtio_scsi_vm.py
   test_03_vpc_privategw_restart_vpc_cleanup | `Failure` | 1116.27 | 
test_privategw_acl.py
   test_05_rvpc_multi_tiers | `Failure` | 319.98 | test_vpc_redundant.py
   test_05_rvpc_multi_tiers | `Error` | 341.79 | test_vpc_redundant.py
   test_hostha_enable_ha_when_host_in_maintenance | `Error` | 2.44 | 
test_hostha_kvm.py
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-26 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523688#comment-16523688
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

blueorangutan commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-400298918
 
 
   @borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has 
been kicked to run smoke tests


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-26 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523686#comment-16523686
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

borisstoyanov commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-400298383
 
 
   @blueorangutan test


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-26 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523684#comment-16523684
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

blueorangutan commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-400297724
 
 
   Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2154


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-26 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523655#comment-16523655
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

borisstoyanov commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-400288716
 
 
   @blueorangutan package


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-26 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523656#comment-16523656
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

blueorangutan commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-400288946
 
 
   @borisstoyanov a Jenkins job has been kicked to build packages. I'll keep 
you posted as I make progress.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-25 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522435#comment-16522435
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 edited a comment on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-399988928
 
 
   Looks like there is an additional Issue open #2657 and related PR #2658


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-25 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522423#comment-16522423
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-399988928
 
 
   Looks like there is an additional Issue open #2657 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-25 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522416#comment-16522416
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 commented on issue #2472: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2472#issuecomment-399987572
 
 
   Closing and rebasing to 4.11 in PR #2722 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-25 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522414#comment-16522414
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 opened a new pull request #2722: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2722
 
 
   Rebase of #2472 onto 4.11
   
   If the KVM heartbeat file can't be written to, the host is rebooted, and 
thus taking down all VMs running on it. The code does try 5x times before the 
reboot, but the there is not a delay between the retires, so they are 5 
simultaneous retries, which doesn't help. Standard SAN storage HA operations or 
quick network blip could cause this reboot to occur.
   
   Some discussions on the dev mailing list revealed that some people are just 
commenting out the reboot line in their version of the CloudStack source.
   
   A better option would be have it sleep between tries so it isn't 5x almost 
simultaneous tries. Plus, instead of rebooting, the cloudstack-agent could just 
be stopped on the host instead. This will cause alerts to be issued and if the 
host is disconnected long-enough, depending on the HA code in use, VM HA could 
handle the host failure.
   
   The built-in reboot of the host seemed drastic


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-23 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16521228#comment-16521228
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

DaanHoogland commented on issue #2472: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2472#issuecomment-399699174
 
 
   @Slair1  validating it will be much easier. we can forward merge but 4.11 is 
the about to be the current LTS version.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-06-22 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520510#comment-16520510
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 commented on issue #2472: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2472#issuecomment-399485087
 
 
   @wido @DaanHoogland @rhtyd 
   Thoughts on me rebasing this to 4.11?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-05-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483566#comment-16483566
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

wido commented on issue #2472: CLOUDSTACK-10310 Fix KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#issuecomment-390883172
 
 
   @rhtyd Both cases are not ideal. Because if you stop the Agent due to a 
single Primary Storage failure you leave VMs running there which still might be 
doing I/O and to start them a second time somewhere else.
   
   The hard reboot isn't a very nice either, so you have to deal with two bad 
things.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-05-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483553#comment-16483553
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

DaanHoogland commented on issue #2472: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2472#issuecomment-390876820
 
 
   echo b > /proc/sysrq-trigger is not acceptable to more people, I don't see 
how killing (and maybe restarting) the agent can be worse. It might not solve 
issues related to nfs for instance.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-05-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482270#comment-16482270
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

rhtyd commented on issue #2472: CLOUDSTACK-10310 Fix KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#issuecomment-390579296
 
 
   @DaanHoogland @rafaelweingartner @wido @bwsw - do you think stopping the 
agent can cause a bigger regression than the historic behaviour of a host 
reboot? I did not follow any discussion (if any) had on dev ML?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404409#comment-16404409
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

rhtyd commented on a change in pull request #2472: CLOUDSTACK-10310 Fix KVM 
reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r175337493
 
 

 ##
 File path: scripts/vm/hypervisor/kvm/kvmheartbeat.sh
 ##
 @@ -155,10 +155,10 @@ then
   exit 0
 elif [ "$cflag" == "1" ]
 then
-  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was 
unable to write the heartbeat to the storage."
+  /usr/bin/logger -t heartbeat "kvmheartbeat.sh stopped cloudstack-agent 
because it was unable to write the heartbeat to the storage."
   sync &
   sleep 5
-  echo b > /proc/sysrq-trigger
+  service cloudstack-agent stop
 
 Review comment:
   Maybe you can re-send the PR in two parts, the first half that increases the 
heartbeat retry threshold to `5` LGTM. The second part could be the 
configuration fix, global setting or setting in agent.properties to decide 
whether to reboot kvm host or stop agent.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404408#comment-16404408
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

rhtyd commented on a change in pull request #2472: CLOUDSTACK-10310 Fix KVM 
reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r175337493
 
 

 ##
 File path: scripts/vm/hypervisor/kvm/kvmheartbeat.sh
 ##
 @@ -155,10 +155,10 @@ then
   exit 0
 elif [ "$cflag" == "1" ]
 then
-  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was 
unable to write the heartbeat to the storage."
+  /usr/bin/logger -t heartbeat "kvmheartbeat.sh stopped cloudstack-agent 
because it was unable to write the heartbeat to the storage."
   sync &
   sleep 5
-  echo b > /proc/sysrq-trigger
+  service cloudstack-agent stop
 
 Review comment:
   Maybe you can sent the PR in two parts, the first half that increases the 
heartbeat retry threshold to `5` LGTM. The second part could be the 
configuration fix, global setting or setting in agent.properties to decide 
whether to reboot kvm host or stop agent.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397555#comment-16397555
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 commented on issue #2472: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2472#issuecomment-372797350
 
 
   @wido My opinion is that the ability to start a VM on two hosts in the 
cluster is a bigger issue.  We run KVM and just enabled libvirt's lockd to stop 
that from being able to happen.  Maybe this PR could be changed to be an 
option, just reboot host or stop agent.
   
   https://libvirt.org/locking-lockd.html


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397530#comment-16397530
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 commented on a change in pull request #2472: CLOUDSTACK-10310 Fix KVM 
reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r174260156
 
 

 ##
 File path: scripts/vm/hypervisor/kvm/kvmheartbeat.sh
 ##
 @@ -155,10 +155,10 @@ then
   exit 0
 elif [ "$cflag" == "1" ]
 then
-  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was 
unable to write the heartbeat to the storage."
+  /usr/bin/logger -t heartbeat "kvmheartbeat.sh stopped cloudstack-agent 
because it was unable to write the heartbeat to the storage."
   sync &
   sleep 5
-  echo b > /proc/sysrq-trigger
+  service cloudstack-agent stop
 
 Review comment:
   Sorry for the delayed response, i have been out of town for the last week.  
Yea, i'm OK with a global parameter.  i'd just have to sit down and determine 
how to do that.  I could likely just make it a parameter passed to the script...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392926#comment-16392926
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

wido commented on issue #2472: CLOUDSTACK-10310 Fix KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#issuecomment-371826140
 
 
   Can VM HA handle such a failure? Let's say you have two NFS mounts on that 
host and once fails. You only stop the Agent, but other VMs keep running.
   
   Afaik that can still cause a corruption where a VM is spawned twice in a 
cluster because the other host can't know if that VM is still running or not.
   
   Our investigators aren't that smart. They rely on ICMP and such which might 
be blocked.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385115#comment-16385115
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

rafaelweingartner commented on a change in pull request #2472: CLOUDSTACK-10310 
Fix KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r172048341
 
 

 ##
 File path: scripts/vm/hypervisor/kvm/kvmheartbeat.sh
 ##
 @@ -155,10 +155,10 @@ then
   exit 0
 elif [ "$cflag" == "1" ]
 then
-  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was 
unable to write the heartbeat to the storage."
+  /usr/bin/logger -t heartbeat "kvmheartbeat.sh stopped cloudstack-agent 
because it was unable to write the heartbeat to the storage."
   sync &
   sleep 5
-  echo b > /proc/sysrq-trigger
+  service cloudstack-agent stop
 
 Review comment:
   Yes, the externalization for a global settings would need more work, but I 
think it is worth it.
   We could help/guide @Slair1 if he does not where to start then.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385113#comment-16385113
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

rhtyd commented on a change in pull request #2472: CLOUDSTACK-10310 Fix KVM 
reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r172048283
 
 

 ##
 File path: scripts/vm/hypervisor/kvm/kvmheartbeat.sh
 ##
 @@ -155,10 +155,10 @@ then
   exit 0
 elif [ "$cflag" == "1" ]
 then
-  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was 
unable to write the heartbeat to the storage."
+  /usr/bin/logger -t heartbeat "kvmheartbeat.sh stopped cloudstack-agent 
because it was unable to write the heartbeat to the storage."
   sync &
   sleep 5
-  echo b > /proc/sysrq-trigger
+  service cloudstack-agent stop
 
 Review comment:
   Good idea @rafaelweingartner, perhaps a cluster scope global parameter. 
However, such a setting need to be propagated when agent/client connects to 
mgmt server (say via ReadyCommand) or overridable in agent.properties file.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385111#comment-16385111
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

rafaelweingartner commented on a change in pull request #2472: CLOUDSTACK-10310 
Fix KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r172048154
 
 

 ##
 File path: scripts/vm/hypervisor/kvm/kvmheartbeat.sh
 ##
 @@ -155,10 +155,10 @@ then
   exit 0
 elif [ "$cflag" == "1" ]
 then
-  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was 
unable to write the heartbeat to the storage."
+  /usr/bin/logger -t heartbeat "kvmheartbeat.sh stopped cloudstack-agent 
because it was unable to write the heartbeat to the storage."
   sync &
   sleep 5
-  echo b > /proc/sysrq-trigger
+  service cloudstack-agent stop
 
 Review comment:
   Instead, what about creating a global parameter to determine this behavior? 
Then, the administrator would be able to change this easily.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385110#comment-16385110
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

rhtyd commented on a change in pull request #2472: CLOUDSTACK-10310 Fix KVM 
reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r172048102
 
 

 ##
 File path: scripts/vm/hypervisor/kvm/kvmheartbeat.sh
 ##
 @@ -155,10 +155,10 @@ then
   exit 0
 elif [ "$cflag" == "1" ]
 then
-  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was 
unable to write the heartbeat to the storage."
+  /usr/bin/logger -t heartbeat "kvmheartbeat.sh stopped cloudstack-agent 
because it was unable to write the heartbeat to the storage."
   sync &
   sleep 5
-  echo b > /proc/sysrq-trigger
+  service cloudstack-agent stop
 
 Review comment:
   Please start a discussion on dev@, this change to simply stop the agent may 
not be acceptable for all cases.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385098#comment-16385098
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

rafaelweingartner commented on a change in pull request #2472: CLOUDSTACK-10310 
Fix KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r172047473
 
 

 ##
 File path: 
plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
 ##
 @@ -34,7 +34,8 @@
 protected static String s_heartBeatPath;
 protected long _heartBeatUpdateTimeout = 6;
 protected long _heartBeatUpdateFreq = 6;
-protected long _heartBeatUpdateMaxRetry = 3;
+protected long _heartBeatUpdateMaxTries = 5;
 
 Review comment:
   What about externalizing these two information as parameters for operators 
to customize?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385002#comment-16385002
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 commented on issue #2472: CLOUDSTACK-10310 Fix KVM reboot on storage 
issue
URL: https://github.com/apache/cloudstack/pull/2472#issuecomment-370207112
 
 
   Comments should be addressed - i also repushed and squashed commits


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384454#comment-16384454
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

bwsw commented on a change in pull request #2472: CLOUDSTACK-10310 Fix KVM 
reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r172004532
 
 

 ##
 File path: 
plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java
 ##
 @@ -115,22 +115,33 @@ protected void runInContext() {
 }
 
 String result = null;
-for (int i = 0; i < 5; i++) {
+// Try 3 times, but sleep in between tries to ensure it 
isn't a short lived transient error
+for (int i = 1; i <= 4; i++) {
 Script cmd = new Script(s_heartBeatPath, 
_heartBeatUpdateTimeout, s_logger);
 cmd.add("-i", primaryStoragePool._poolIp);
 cmd.add("-p", primaryStoragePool._poolMountSourcePath);
 cmd.add("-m", primaryStoragePool._mountDestPath);
 cmd.add("-h", _hostIP);
 result = cmd.execute();
 if (result != null) {
-s_logger.warn("write heartbeat failed: " + result 
+ ", retry: " + i);
+s_logger.warn("write heartbeat failed: " + result 
+ ", retry: " + i + " of 4");
 
 Review comment:
   I disabled that behaviour in our 4.3 completely.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383315#comment-16383315
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

DaanHoogland commented on a change in pull request #2472: CLOUDSTACK-10310 Fix 
KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r171781559
 
 

 ##
 File path: 
plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java
 ##
 @@ -115,22 +115,33 @@ protected void runInContext() {
 }
 
 String result = null;
-for (int i = 0; i < 5; i++) {
+// Try 3 times, but sleep in between tries to ensure it 
isn't a short lived transient error
+for (int i = 1; i <= 4; i++) {
 
 Review comment:
   the comment says try three times but i count 4 iterations. either the 
comment or the loop-control needs to change


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383314#comment-16383314
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

DaanHoogland commented on a change in pull request #2472: CLOUDSTACK-10310 Fix 
KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r171781689
 
 

 ##
 File path: 
plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java
 ##
 @@ -115,22 +115,33 @@ protected void runInContext() {
 }
 
 String result = null;
-for (int i = 0; i < 5; i++) {
+// Try 3 times, but sleep in between tries to ensure it 
isn't a short lived transient error
+for (int i = 1; i <= 4; i++) {
 Script cmd = new Script(s_heartBeatPath, 
_heartBeatUpdateTimeout, s_logger);
 cmd.add("-i", primaryStoragePool._poolIp);
 cmd.add("-p", primaryStoragePool._poolMountSourcePath);
 cmd.add("-m", primaryStoragePool._mountDestPath);
 cmd.add("-h", _hostIP);
 result = cmd.execute();
 if (result != null) {
-s_logger.warn("write heartbeat failed: " + result 
+ ", retry: " + i);
+s_logger.warn("write heartbeat failed: " + result 
+ ", retry: " + i + " of 4");
 
 Review comment:
   if this 4 in the log message is the same one as the one in the control 
statement you want to extract to a constant/field/variable


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383300#comment-16383300
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 commented on a change in pull request #2472: CLOUDSTACK-10310 Fix KVM 
reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r171780920
 
 

 ##
 File path: scripts/vm/hypervisor/kvm/kvmheartbeat.sh
 ##
 @@ -155,10 +155,10 @@ then
   exit 0
 elif [ "$cflag" == "1" ]
 then
-  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was 
unable to write the heartbeat to the storage."
+  /usr/bin/logger -t heartbeat "kvmheartbeat.sh stopped cloudstack-agent 
because it was unable to write the heartbeat to the storage."
   sync &
   sleep 5
-  echo b > /proc/sysrq-trigger
+  systemctl stop cloudstack-agent
 
 Review comment:
   Good thought, I can change that to “service” for older version compatibility 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383277#comment-16383277
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

borisstoyanov commented on a change in pull request #2472: CLOUDSTACK-10310 Fix 
KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2472#discussion_r171776284
 
 

 ##
 File path: scripts/vm/hypervisor/kvm/kvmheartbeat.sh
 ##
 @@ -155,10 +155,10 @@ then
   exit 0
 elif [ "$cflag" == "1" ]
 then
-  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was 
unable to write the heartbeat to the storage."
+  /usr/bin/logger -t heartbeat "kvmheartbeat.sh stopped cloudstack-agent 
because it was unable to write the heartbeat to the storage."
   sync &
   sleep 5
-  echo b > /proc/sysrq-trigger
+  systemctl stop cloudstack-agent
 
 Review comment:
   @Slair1 are you sure this will work with CentOS6 and debian KVM hosts? Same 
goes with the change above


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error

2018-03-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382637#comment-16382637
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
-

Slair1 opened a new pull request #2472: CLOUDSTACK-10310 Fix KVM reboot on 
storage issue
URL: https://github.com/apache/cloudstack/pull/2472
 
 
   If the KVM heartbeat file can't be written to, the host is rebooted, and 
thus taking down all VMs running on it.  The code does try 5x times before the 
reboot, but the there is not a delay between the retires, so they are 5 
simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
or quick network blip could cause this reboot to occur.
   
   Some discussions on the dev mailing list revealed that some people are just 
commenting out the reboot line in their version of the CloudStack source.
   
   A better option would be have it sleep between tries so it isn't 5x almost 
simultaneous tries.  Plus, instead of rebooting, the cloudstack-agent could 
just be stopped on the host instead.  This will cause alerts to be issued and 
if the host is disconnected long-enough, depending on the HA code in use, VM HA 
could handle the host failure.
   
   The built-in reboot of the host seemed drastic


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> 
>
> Key: CLOUDSTACK-10310
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
> Project: CloudStack
>  Issue Type: Improvement
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: KVM
>Affects Versions: 4.9.0, 4.10.0.0
>Reporter: Sean Lair
>Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus 
> taking down all VMs running on it.  The code does try 5x times before the 
> reboot, but the there is not a delay between the retires, so they are 5 
> simultaneous retries, which doesn't help.  Standard SAN storage HA operations 
> or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just 
> commenting out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between 
> tries so it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, 
> the cloudstack-agent could just be stopped on the host instead.  This will 
> cause alerts to be issued and if the host is disconnected long-enough, 
> depending on the HA code in use, VM HA could handle the host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

52 matches

Mail list logo