[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-12-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251498#comment-14251498
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 8dfcd51bebe990449faff21e39b2b9315d11046d in cloudstack's branch 
refs/heads/4.4 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=8dfcd51 ]

CLOUDSTACK-7184 new config var is not available when not in the db


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-12-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251530#comment-14251530
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 86c425ec839ccc14fe323b48b875753633675e3f in cloudstack's branch 
refs/heads/4.4 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=86c425e ]

CLOUDSTACK-7184 c&p feature fix in conf-value-description


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-12-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251532#comment-14251532
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 8b6e251b5d5e47a45f392d999fde0f14765c3ca1 in cloudstack's branch 
refs/heads/4.5 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=8b6e251 ]

CLOUDSTACK-7184 config value for xen heartbeat timeout


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-12-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251566#comment-14251566
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 8b6e251b5d5e47a45f392d999fde0f14765c3ca1 in cloudstack's branch 
refs/heads/master from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=8b6e251 ]

CLOUDSTACK-7184 config value for xen heartbeat timeout


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-12-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251651#comment-14251651
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 67a7f74be0db6f2d505c9ba1ade7782238e1962a in cloudstack's branch 
refs/heads/4.5 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=67a7f74 ]

CLOUDSTACK-7184 fieldname typo


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-12-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251652#comment-14251652
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 67a7f74be0db6f2d505c9ba1ade7782238e1962a in cloudstack's branch 
refs/heads/master from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=67a7f74 ]

CLOUDSTACK-7184 fieldname typo


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-12-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253402#comment-14253402
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 79fc9d26875dc03465f26f9fe8a9dd188b64cb6a in cloudstack's branch 
refs/heads/4.4 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=79fc9d2 ]

CLOUDSTACK-7184 fieldname typo


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-07-28 Thread Daan Hoogland (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076195#comment-14076195
 ] 

Daan Hoogland commented on CLOUDSTACK-7184:
---

after investigation of /opt/cloud/bin/xenheartbeat.sh: 
xen(server).heartbeat.interval is being used as timout in the scripts. 
CloudStack passes two parameters and leaves out the second (instead of the 
third).

During a heartbeat event Cloudstack honors no timeout before respinning any vms 
on the absent host. This gives the host no chance to fence.

We need to add a timeout to pass to the heartbeat script and to let cloudstack 
postpone starting HA jobs that are the result of the absent host.

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-07-29 Thread Anthony Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078553#comment-14078553
 ] 

Anthony Xu commented on CLOUDSTACK-7184:


CS talks to other host and run check_heartbeat.sh in other host to make sure 
the host self-fence.
But it highly depends on all hosts in cluster having synced time,  NTP is 
recommended for XS hosts.

please check the time in hosts in the cluster, if time is not synced, that 
would be the cause for this issue.


Anthony



> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-08-07 Thread Daan Hoogland (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089003#comment-14089003
 ] 

Daan Hoogland commented on CLOUDSTACK-7184:
---

Anthony, this is not an NTP issue. It is cloudstack passing two parameters to 
the heartbeat script instead of three. The second one passed has the semantics 
of the third.
There is an interval and an timeout, both should be send, only one value is 
passed.

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-08-21 Thread John Kinsella (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105710#comment-14105710
 ] 

John Kinsella commented on CLOUDSTACK-7184:
---

I've seen similar with KVM - I'm not sure this is necessarily tied to Xen? I'd 
suggest that possibly CS be a little more thorough before deciding a VM is 
down...maybe via channels other than the agent/VR?

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-08-21 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105718#comment-14105718
 ] 

Joris van Lieshout commented on CLOUDSTACK-7184:


Hi, I am currently out of office and will be back Wednesday the 27th of August. 
During this time I will have limited access to e-mail and might not be able to 
take your call. For urgent matter regarding ASR please contact 
int-...@schubergphilis.com instead. For other urgent matter please contact one 
of my colleagues.

Kind regards,
Joris van Lieshout


Schuberg Philis
schubergphilis.com

+31207506672
+31651428188


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-08 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125514#comment-14125514
 ] 

Joris van Lieshout commented on CLOUDSTACK-7184:


Hi, I am currently out of office and will be back Tuesday the 23rd of 
September. During this time I will have limited access to e-mail and might not 
be able to take your call. For urgent matter regarding ASR please contact 
int-...@schubergphilis.com instead. For Cloud IaaS matters please contact 
int-cl...@schubergphilis.com.

Kind regards,
Joris van Lieshout


Schuberg Philis
schubergphilis.com

+31207506672
+31651428188


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-08 Thread Daan Hoogland (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125515#comment-14125515
 ] 

Daan Hoogland commented on CLOUDSTACK-7184:
---

John, is KVM using the same xenheartbeat.sh? it would seem not. Remy and I have 
found that it is actually a wrong number of parameters passed into this script 
which is the problem.

cloudstack calls xenheartbeat.sh  

xenheartbeat.sh parses   and then finds no 

I will go and write a thingy around this.

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127488#comment-14127488
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 7f97bf7c58a348e684f7138c93c287a3c86ca55b in cloudstack's branch 
refs/heads/hotfix/4.4/CLOUDSTACK-7184 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=7f97bf7 ]

CLOUDSTACK-7184: xenheartbeat gets passed timeout and interval

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-09 Thread Daan Hoogland (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127523#comment-14127523
 ] 

Daan Hoogland commented on CLOUDSTACK-7184:
---

Could some of you please review hotfix/4.4/CLOUDSTACK-7184

I could do with some advice on how to thoroughly test this functionality as
well.

thanks,
Daan


On Tue, Sep 9, 2014 at 10:22 PM, ASF subversion and git services (JIRA) <




-- 
Daan


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-09 Thread John Kinsella (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127569#comment-14127569
 ] 

John Kinsella commented on CLOUDSTACK-7184:
---

No, KVM uses kvmheartbeat.sh, which unfortunately looks a lot different than 
xenheartbeat.sh.

Looks like the issue I've seen is different, will try to track it 
down/replicate and file a separate bug.

Patch looks clean, but I'm not in position to test it at the moment.


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-10 Thread Brenn Oosterbaan (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128206#comment-14128206
 ] 

Brenn Oosterbaan commented on CLOUDSTACK-7184:
--

"I've seen similar with KVM - I'm not sure this is necessarily tied to Xen? I'd 
suggest that possibly CS be a little more thorough before deciding a VM is 
down...maybe via channels other than the agent/VR?"

John is right on the money here. Although the patch comitted by Daan does give 
the possibility to specify a check interval for the Xen storage heartbeat 
script (instead of using the default of 5 seconds) it is not the root cause of 
this issue.

There are two mechanisms at work here. The xen heartbeat script which checks if 
the storage is reachable on a specific hypervisors, and Cloudstack which 
determines if a hypervisor is up or not.

When we set the Xen heartbeat interval to 180 seconds we basicly said: it's ok 
for vm's living on a hypervisor to 'hang' for 180 seconds in case of storage 
fail-overs or other issues.
Cloudstack has it's own checking mechanisms to determine if a hypervisor is 
down or not. Those checks are not in line with the xen heartbeat interval. 
Which means that even though we decided 180 seconds of unavailability is fine, 
Cloudstack tries to connect to the hypervisor 3 times (in ~30 seconds) and then 
decides it is down and starts the vm's on another hypervisor. 
That is the issue/bug Remi meant to identify when filing this ticket.

I personally feel there should be two additional global options: 
hypervisor.heartbeat.interval and hypervisor.heartbeat.max_retry.
This would allow us to decide to (for instance) set the interval to 15 seconds 
and the max_retry to 12. Which would then add up to 180 seconds as well. 
Since the default heartbeat timeout is 60 seconds I would set the defaults for 
these to a combination which allows for 60 seconds as well. Otherwise you will 
never be sure the hypervisor it self has actually rebooted and thus VM 
corruption could still take place.

regards,

Brenn

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-10 Thread Brenn Oosterbaan (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128217#comment-14128217
 ] 

Brenn Oosterbaan commented on CLOUDSTACK-7184:
--

Implementing the solution above looks like the fastest/easiest fix to me.

Ideally you would want a second heartbeat script on the hypervisor as well. One 
which checks if Cloudstack has been able to communicate with the hypervisor, 
and if not fences the hypervisor using the same interval and retry counts as 
Cloudstack.
This would allow us to set different timeouts for different scenario's without 
the possibility of corrupted VM's:
- If the storage is unreachable but the hypervisor is up and running - wait 
until storage timeout value is reached and reboot.
- If the hypervisor has crashed OR the networking stack for that hypervisor is 
stuck/unreachable - wait until the hypervisor max retry value is met, HA the 
vm's and reboot the hypervisor.

That way we could decide to set the hypervisor.heartbeat.interval to 5 and the 
hypervisor.heartbeat.max_retry to 6 and the storage timeout to 180. Which would 
effectively say: if storage is down wait 180 seconds, if something else 
happened only wait 30 seconds.

But this should probably be a feature request.

regards,

Brenn

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-10 Thread Daan Hoogland (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128240#comment-14128240
 ] 

Daan Hoogland commented on CLOUDSTACK-7184:
---

Brenn's idea seems sane to me, I will add the config items for the 
ping-check/fencing as well

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-10 Thread Remi Bergsma (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128253#comment-14128253
 ] 

Remi Bergsma commented on CLOUDSTACK-7184:
--

+1!


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-10 Thread Remi Bergsma (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128266#comment-14128266
 ] 

Remi Bergsma commented on CLOUDSTACK-7184:
--

[~dahn] Now that you're working on this script, please also look at 
CLOUDSTACK-7527 (https://issues.apache.org/jira/browse/CLOUDSTACK-7527). Thx!


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-15 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134023#comment-14134023
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit b0641a7d279734970577a3a87940abd030a6a8c2 in cloudstack's branch 
refs/heads/hotfix/4.4/CLOUDSTACK-7184 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=b0641a7 ]

CLOUDSTACK-7184 timeout configuration value for host check

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135276#comment-14135276
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit b82f27be4150e70c017ed2597137319daa79560b in cloudstack's branch 
refs/heads/hotfix/4.4/CLOUDSTACK-7184 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=b82f27b ]

CLOUDSTACK-7184 retry-wait loop config to deal with network glitches


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135274#comment-14135274
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 4d065b9a3a336d59902c266202c1094509c007d2 in cloudstack's branch 
refs/heads/hotfix/4.4/CLOUDSTACK-7184 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=4d065b9 ]

CLOUDSTACK-7184: xenheartbeat gets passed timeout and interval

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135589#comment-14135589
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 4d065b9a3a336d59902c266202c1094509c007d2 in cloudstack's branch 
refs/heads/4.4 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=4d065b9 ]

CLOUDSTACK-7184: xenheartbeat gets passed timeout and interval

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137190#comment-14137190
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit d0b39df1c38d33608628d650734410d8c72837e4 in cloudstack's branch 
refs/heads/hotfix/4.4/CLOUDSTACK-7184 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=d0b39df ]

CLOUDSTACK-7184 retry-wait loop config to deal with network glitches

(cherry picked from commit b82f27be4150e70c017ed2597137319daa79560b)


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-17 Thread Daan Hoogland (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138584#comment-14138584
 ] 

Daan Hoogland commented on CLOUDSTACK-7184:
---

after considderation and consult with Rohit and Hugo; the added timeout/retry 
configurables are useful for tuning the system to be resiliant to net glitches 
but do not solve the root cause described in the issue. I will implement them 
and open a new issue for improving the XenServerInvestigator.

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138714#comment-14138714
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit a29f954a269c992307f0410df88ca4ac7a0b82a0 in cloudstack's branch 
refs/heads/hotfix/4.4/CLOUDSTACK-7184 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=a29f954 ]

CLOUDSTACK-7184 retry-wait loop config to deal with network glitches


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138717#comment-14138717
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit a29f954a269c992307f0410df88ca4ac7a0b82a0 in cloudstack's branch 
refs/heads/4.4 from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=a29f954 ]

CLOUDSTACK-7184 retry-wait loop config to deal with network glitches


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138783#comment-14138783
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit dec9133dcd8e2625295075ad0afe54812bc9fe00 in cloudstack's branch 
refs/heads/master from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=dec9133 ]

CLOUDSTACK-7184: xenheartbeat gets passed timeout and interval
(cherry picked from commit 4d065b9a3a336d59902c266202c1094509c007d2)

Conflicts:

plugins/hypervisors/xenserver/src/com/cloud/hypervisor/xenserver/discoverer/XcpServerDiscoverer.java

plugins/hypervisors/xenserver/src/com/cloud/hypervisor/xenserver/resource/CitrixResourceBase.java
server/src/com/cloud/configuration/Config.java
server/src/com/cloud/configuration/ConfigurationManagerImpl.java
server/src/com/cloud/resource/DiscovererBase.java


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138786#comment-14138786
 ] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-

Commit 7f440854f7bcd41a1bd6581c0239cde2e98261b7 in cloudstack's branch 
refs/heads/master from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=7f44085 ]

CLOUDSTACK-7184 retry-wait loop config to deal with network glitches

(cherry picked from commit a29f954a269c992307f0410df88ca4ac7a0b82a0)

Conflicts:
engine/orchestration/src/com/cloud/agent/manager/DirectAgentAttache.java


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Assignee: Daan Hoogland
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)