[jira] [Commented] (CLOUDSTACK-7857) CitrixResourceBase wrongly calculates total memory on hosts with a lot of memory and large Dom0

2014-11-18 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216355#comment-14216355
 ] 

Joris van Lieshout commented on CLOUDSTACK-7857:


I'm not too familiar with mem overhead on other hypervisors. You would think 
the formula would be some what the same. I understand that ACS has to be as 
flexible as possible but what if the logic of calculating free mem is moved to 
the hypervisor plugin so the logic in calculating can be specific but the 
outcome used by generic processes the same? I'm not a developer so my apologies 
if my comment does not make any sense. 
In the end any hypervisor should be able to provide some information about 
available memory, either by calculation of with a direct metric. Perhaps this 
will always be something hypervisor specifies...?

> CitrixResourceBase wrongly calculates total memory on hosts with a lot of 
> memory and large Dom0
> ---
>
> Key: CLOUDSTACK-7857
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7857
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
>Reporter: Joris van Lieshout
>Priority: Blocker
>
> We have hosts with 256GB memory and 4GB dom0. During startup ACS calculates 
> available memory using this formula:
> CitrixResourceBase.java
>   protected void fillHostInfo
>   ram = (long) ((ram - dom0Ram - _xs_memory_used) * 
> _xs_virtualization_factor);
> In our situation:
>   ram = 274841497600
>   dom0Ram = 4269801472
>   _xs_memory_used = 128 * 1024 * 1024L = 134217728
>   _xs_virtualization_factor = 63.0/64.0 = 0,984375
>   (274841497600 - 4269801472 - 134217728) * 0,984375 = 266211892800
> This is in fact not the actual amount of memory available for instances. The 
> difference in our situation is a little less then 1GB. On this particular 
> hypervisor Dom0+Xen uses about 9GB.
> As the comment above the definition of XsMemoryUsed allready stated it's time 
> to review this logic. 
> "//Hypervisor specific params with generic value, may need to be overridden 
> for specific versions"
> The effect of this bug is that when you put a hypervisor in maintenance it 
> might try to move instances (usually small instances (<1GB)) to a host that 
> in fact does not have enought free memory.
> This exception is thrown:
> ERROR [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-09aca6e9 
> work-8981) Terminating HAWork[8981-Migration-4482-Running-Migrating]
> com.cloud.utils.exception.CloudRuntimeException: Unable to migrate due to 
> Catch Exception com.cloud.utils.exception.CloudRuntimeException: Migration 
> failed due to com.cloud.utils.exception.CloudRuntim
> eException: Unable to migrate VM(r-4482-VM) from 
> host(6805d06c-4d5b-4438-a245-7915e93041d9) due to Task failed! Task record:   
>   uuid: 645b63c8-1426-b412-7b6a-13d61ee7ab2e
>nameLabel: Async.VM.pool_migrate
>  nameDescription: 
>allowedOperations: []
>currentOperations: {}
>  created: Thu Nov 06 13:44:14 CET 2014
> finished: Thu Nov 06 13:44:14 CET 2014
>   status: failure
>   residentOn: com.xensource.xenapi.Host@b42882c6
> progress: 1.0
> type: 
>   result: 
>errorInfo: [HOST_NOT_ENOUGH_FREE_MEMORY, 272629760, 263131136]
>  otherConfig: {}
>subtaskOf: com.xensource.xenapi.Task@aaf13f6f
> subtasks: []
> at 
> com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:1840)
> at 
> com.cloud.vm.VirtualMachineManagerImpl.migrateAway(VirtualMachineManagerImpl.java:2214)
> at 
> com.cloud.ha.HighAvailabilityManagerImpl.migrate(HighAvailabilityManagerImpl.java:610)
> at 
> com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.runWithContext(HighAvailabilityManagerImpl.java:865)
> at 
> com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.access$000(HighAvailabilityManagerImpl.java:822)
> at 
> com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread$1.run(HighAvailabilityManagerImpl.java:834)
> at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
> at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
> at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
> at 
> com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.run(HighAvailabilityManagerImpl.java:831)



--
This mess

[jira] [Commented] (CLOUDSTACK-7857) CitrixResourceBase wrongly calculates total memory on hosts with a lot of memory and large Dom0

2014-11-17 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214498#comment-14214498
 ] 

Joris van Lieshout commented on CLOUDSTACK-7857:


Hi Anthony,

I agree that that there is no reliable way to do this beforehand so isn't it 
better to do it whenever an instance is started on/migrated to a host, or 
recalculate the free memory metric every couple minutes (for instance as part 
of the stats collection cycle)? The formula that is used by XenCenter for this 
seems pretty easy and spot.

This would also reduce the number of times a retry mechanism has to kick in for 
other action as well. On that note, the retry mechanism you are referring to 
does not seem to apply to HA-workers created by the process that puts a host in 
maintenance. Also it feels to me that this is more of a workaround than a nice 
solution, mostly because host_free_mem can be recalculated quickly and easily 
when needed.

And concerning the allocation threshold. If I'm not mistaking this does not 
apply to HA-workers which is being used whenever you put at host into 
maintenance. Additionally the instance being migrated is already in the cluster 
so this threshold is not hit during PrepairForMaintenance. 

> CitrixResourceBase wrongly calculates total memory on hosts with a lot of 
> memory and large Dom0
> ---
>
> Key: CLOUDSTACK-7857
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7857
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
>Reporter: Joris van Lieshout
>Priority: Blocker
>
> We have hosts with 256GB memory and 4GB dom0. During startup ACS calculates 
> available memory using this formula:
> CitrixResourceBase.java
>   protected void fillHostInfo
>   ram = (long) ((ram - dom0Ram - _xs_memory_used) * 
> _xs_virtualization_factor);
> In our situation:
>   ram = 274841497600
>   dom0Ram = 4269801472
>   _xs_memory_used = 128 * 1024 * 1024L = 134217728
>   _xs_virtualization_factor = 63.0/64.0 = 0,984375
>   (274841497600 - 4269801472 - 134217728) * 0,984375 = 266211892800
> This is in fact not the actual amount of memory available for instances. The 
> difference in our situation is a little less then 1GB. On this particular 
> hypervisor Dom0+Xen uses about 9GB.
> As the comment above the definition of XsMemoryUsed allready stated it's time 
> to review this logic. 
> "//Hypervisor specific params with generic value, may need to be overridden 
> for specific versions"
> The effect of this bug is that when you put a hypervisor in maintenance it 
> might try to move instances (usually small instances (<1GB)) to a host that 
> in fact does not have enought free memory.
> This exception is thrown:
> ERROR [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-09aca6e9 
> work-8981) Terminating HAWork[8981-Migration-4482-Running-Migrating]
> com.cloud.utils.exception.CloudRuntimeException: Unable to migrate due to 
> Catch Exception com.cloud.utils.exception.CloudRuntimeException: Migration 
> failed due to com.cloud.utils.exception.CloudRuntim
> eException: Unable to migrate VM(r-4482-VM) from 
> host(6805d06c-4d5b-4438-a245-7915e93041d9) due to Task failed! Task record:   
>   uuid: 645b63c8-1426-b412-7b6a-13d61ee7ab2e
>nameLabel: Async.VM.pool_migrate
>  nameDescription: 
>allowedOperations: []
>currentOperations: {}
>  created: Thu Nov 06 13:44:14 CET 2014
> finished: Thu Nov 06 13:44:14 CET 2014
>   status: failure
>   residentOn: com.xensource.xenapi.Host@b42882c6
> progress: 1.0
> type: 
>   result: 
>errorInfo: [HOST_NOT_ENOUGH_FREE_MEMORY, 272629760, 263131136]
>  otherConfig: {}
>subtaskOf: com.xensource.xenapi.Task@aaf13f6f
> subtasks: []
> at 
> com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:1840)
> at 
> com.cloud.vm.VirtualMachineManagerImpl.migrateAway(VirtualMachineManagerImpl.java:2214)
> at 
> com.cloud.ha.HighAvailabilityManagerImpl.migrate(HighAvailabilityManagerImpl.java:610)
> at 
> com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.runWithContext(HighAvailabilityManagerImpl.java:865)
> at 
> com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.access$000(HighAvailabilityManagerImpl.java:822)
> at 
> com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread$1.run(HighAvailabilityManagerImpl.java:834)
> at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1

[jira] [Commented] (CLOUDSTACK-7857) CitrixResourceBase wrongly calculates total memory on hosts with a lot of memory and large Dom0

2014-11-13 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209545#comment-14209545
 ] 

Joris van Lieshout commented on CLOUDSTACK-7857:


Hi Rohit, I did some digging around in the XenCenter code and found a possible 
solution there. But there is a challenge I think. The overhead is  dynamic 
based on the instances running on the host, and, at the moment, ACS calculates 
this overhead at host thread startup.

This is what I found in the XenCenter code:
https://github.com/xenserver/xenadmin/blob/a0d31920c5ac62eda9713228043a834ba7829986/XenModel/XenAPI-Extensions/Host.cs#L1071
==
public long xen_memory_calc
{
get
{
if (!Helpers.MidnightRideOrGreater(Connection))
{
Host_metrics host_metrics = 
Connection.Resolve(this.metrics);
if (host_metrics == null)
return 0;
long totalused = 0;
foreach (VM vm in Connection.ResolveAll(resident_VMs))
{
VM_metrics vmMetrics = 
vm.Connection.Resolve(vm.metrics);
if (vmMetrics != null)
totalused += vmMetrics.memory_actual;
}
return host_metrics.memory_total - totalused - 
host_metrics.memory_free;
}
long xen_mem = memory_overhead;
foreach (VM vm in Connection.ResolveAll(resident_VMs))
{
xen_mem += vm.memory_overhead;
if (vm.is_control_domain)
{
VM_metrics vmMetrics = 
vm.Connection.Resolve(vm.metrics);
if (vmMetrics != null)
xen_mem += vmMetrics.memory_actual;
}
}
return xen_mem;
}
}
=
We can skip the first part because, if I'm not mistaking, ACS only supports 
XS5.6 and up. XS5.6 = MidnightRide
In short the formula is something like this: xen_mem = host_memory_overhead + 
residentVMs_memory_overhead + dom0_memory_actual

Here is a list of xe commands that will get you the correct numbers to 
summarize. 
host_mem_overhead
xe host-list name-label=$HOSTNAME params=memory-overhead --minimal
residentVMs_memory_overhead 
xe vm-list resident-on=$(xe host-list name-label=$HOSTNAME --minimal) 
params=memory-overhead --minimal
dom0_memory_actual
xe vm-list resident-on=$(xe host-list name-label=$HOSTNAME --minimal) 
is-control-domain=true params=memory-actual --minimal

> CitrixResourceBase wrongly calculates total memory on hosts with a lot of 
> memory and large Dom0
> ---
>
> Key: CLOUDSTACK-7857
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7857
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
>Reporter: Joris van Lieshout
>Priority: Blocker
>
> We have hosts with 256GB memory and 4GB dom0. During startup ACS calculates 
> available memory using this formula:
> CitrixResourceBase.java
>   protected void fillHostInfo
>   ram = (long) ((ram - dom0Ram - _xs_memory_used) * 
> _xs_virtualization_factor);
> In our situation:
>   ram = 274841497600
>   dom0Ram = 4269801472
>   _xs_memory_used = 128 * 1024 * 1024L = 134217728
>   _xs_virtualization_factor = 63.0/64.0 = 0,984375
>   (274841497600 - 4269801472 - 134217728) * 0,984375 = 266211892800
> This is in fact not the actual amount of memory available for instances. The 
> difference in our situation is a little less then 1GB. On this particular 
> hypervisor Dom0+Xen uses about 9GB.
> As the comment above the definition of XsMemoryUsed allready stated it's time 
> to review this logic. 
> "//Hypervisor specific params with generic value, may need to be overridden 
> for specific versions"
> The effect of this bug is that when you put a hypervisor in maintenance it 
> might try to move instances (usually small instances (<1GB)) to a host that 
> in fact does not have enought free memory.
> This exception is thrown:
> ERROR [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-09aca6e9 
> work-8981) Terminating HAWork[8981-Migration-4482-Running-Migrating]
> com.cloud.utils.exception.CloudRuntimeException: Unable to migrate due to 
> Catch Exception com.cloud.utils.exception.CloudRuntimeException: Migration 
> failed due to com.cloud.utils.exception.CloudRuntim
> eExcep

[jira] [Commented] (CLOUDSTACK-7853) Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert

2014-11-10 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204744#comment-14204744
 ] 

Joris van Lieshout commented on CLOUDSTACK-7853:


What I just saw in our management log is that 3 minutes before the management 
server found the host behind on ping the cluster was put in Unmanage mode 
(XenServer patching maintenance).

I also noticed that the AgentTaskPool threads that would do the investigation 
you mention was not triggered for this host. I don't know if this is because it 
was busy or because the agent thread was destroyed after the cluster was put in 
Unmanage. 

This is how I now believer it went.
1. Cluster Unmanage
2. Host rebooted (the brand of physical boxed we use need at least 10 minutes 
to reboot)
3. Host got behind on ping in the meanwhile
4. Host state transitioned from Disconnected to Alert via PingTimeout
5. On the next AgentMonitor cycle a transition was attempted form Alert via 
PingTimeout. Unknown transition so exception was thrown.
6. Host returned from reboot and cluster was set to manage again
7. Due to this invalid state transition the host never transitioned from Alert 
to something else.

> Hosts that are temporary Disconnected and get behind on ping (PingTimeout) 
> turn up in permanent state Alert
> ---
>
> Key: CLOUDSTACK-7853
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
>Reporter: Joris van Lieshout
>Priority: Critical
>
> If for some reason (I've been unable to determine why but my suspicion is 
> that the management server is busy processing other agent requests and/or 
> xapi is temporary unavailable) a host that is Disconnected gets behind on 
> ping (PingTimeout) it it transitioned to a permanent state of Alert.
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the 
> following agents behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, 
> do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state 
> = Enabled, Agent event = PingTimeout, Host id = 421, name = xx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = 
> 421; name = xx1; old status = Disconnected; event = PingTimeout; new 
> status = Alert; old update count = 111; new update count = 112]
> / next cycle / -
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the 
> following agents behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, 
> do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state 
> = Enabled, Agent event = PingTimeout, Host id = 421, name = xx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent 
> status with event PingTimeout for host 421, name=xx1, mangement server id 
> is 345052370017
> ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the 
> following exception: 
> com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status 
> with event PingTimeout for host 421, mangement server id is 
> 345052370017,Unable to transition to a new state from Alert via PingTimeout
> at 
> com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334)
> at 
> com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349)
> at 
> com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378)
> at 
> com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384)
> at 
> com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466)
> at 
> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
> at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
> at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
> at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
> at 
> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at 
> java.util.concurrent.FutureTask$Sync.in

[jira] [Created] (CLOUDSTACK-7857) CitrixResourceBase wrongly calculates total memory on hosts with a lot of memory and large Dom0

2014-11-06 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-7857:
--

 Summary: CitrixResourceBase wrongly calculates total memory on 
hosts with a lot of memory and large Dom0
 Key: CLOUDSTACK-7857
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7857
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
Reporter: Joris van Lieshout
Priority: Blocker


We have hosts with 256GB memory and 4GB dom0. During startup ACS calculates 
available memory using this formula:
CitrixResourceBase.java
protected void fillHostInfo
ram = (long) ((ram - dom0Ram - _xs_memory_used) * 
_xs_virtualization_factor);
In our situation:
ram = 274841497600
dom0Ram = 4269801472
_xs_memory_used = 128 * 1024 * 1024L = 134217728
_xs_virtualization_factor = 63.0/64.0 = 0,984375
(274841497600 - 4269801472 - 134217728) * 0,984375 = 266211892800

This is in fact not the actual amount of memory available for instances. The 
difference in our situation is a little less then 1GB. On this particular 
hypervisor Dom0+Xen uses about 9GB.
As the comment above the definition of XsMemoryUsed allready stated it's time 
to review this logic. 
"//Hypervisor specific params with generic value, may need to be overridden for 
specific versions"

The effect of this bug is that when you put a hypervisor in maintenance it 
might try to move instances (usually small instances (<1GB)) to a host that in 
fact does not have enought free memory.
This exception is thrown:

ERROR [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-09aca6e9 work-8981) 
Terminating HAWork[8981-Migration-4482-Running-Migrating]
com.cloud.utils.exception.CloudRuntimeException: Unable to migrate due to Catch 
Exception com.cloud.utils.exception.CloudRuntimeException: Migration failed due 
to com.cloud.utils.exception.CloudRuntim
eException: Unable to migrate VM(r-4482-VM) from 
host(6805d06c-4d5b-4438-a245-7915e93041d9) due to Task failed! Task record: 
uuid: 645b63c8-1426-b412-7b6a-13d61ee7ab2e
   nameLabel: Async.VM.pool_migrate
 nameDescription: 
   allowedOperations: []
   currentOperations: {}
 created: Thu Nov 06 13:44:14 CET 2014
finished: Thu Nov 06 13:44:14 CET 2014
  status: failure
  residentOn: com.xensource.xenapi.Host@b42882c6
progress: 1.0
type: 
  result: 
   errorInfo: [HOST_NOT_ENOUGH_FREE_MEMORY, 272629760, 263131136]
 otherConfig: {}
   subtaskOf: com.xensource.xenapi.Task@aaf13f6f
subtasks: []

at 
com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:1840)
at 
com.cloud.vm.VirtualMachineManagerImpl.migrateAway(VirtualMachineManagerImpl.java:2214)
at 
com.cloud.ha.HighAvailabilityManagerImpl.migrate(HighAvailabilityManagerImpl.java:610)
at 
com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.runWithContext(HighAvailabilityManagerImpl.java:865)
at 
com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.access$000(HighAvailabilityManagerImpl.java:822)
at 
com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread$1.run(HighAvailabilityManagerImpl.java:834)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at 
com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.run(HighAvailabilityManagerImpl.java:831)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CLOUDSTACK-7853) Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert

2014-11-06 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-7853:
--

 Summary: Hosts that are temporary Disconnected and get behind on 
ping (PingTimeout) turn up in permanent state Alert
 Key: CLOUDSTACK-7853
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
Reporter: Joris van Lieshout
Priority: Critical


If for some reason (I've been unable to determine why but my suspicion is that 
the management server is busy processing other agent requests and/or xapi is 
temporary unavailable) a host that is Disconnected gets behind on ping 
(PingTimeout) it it transitioned to a permanent state of Alert.

INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the 
following agents behind on ping: [421, 427, 425]
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, 
do invstigation
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state = 
Enabled, Agent event = PingTimeout, Host id = 421, name = xx1]
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = 
421; name = xx1; old status = Disconnected; event = PingTimeout; new status 
= Alert; old update count = 111; new update count = 112]

/ next cycle / -

INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the 
following agents behind on ping: [421, 427, 425]
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, 
do invstigation
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state = 
Enabled, Agent event = PingTimeout, Host id = 421, name = xx1]
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent status 
with event PingTimeout for host 421, name=xx1, mangement server id is 
345052370017
ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the 
following exception: 
com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status 
with event PingTimeout for host 421, mangement server id is 345052370017,Unable 
to transition to a new state from Alert via PingTimeout
at 
com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334)
at 
com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349)
at 
com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378)
at 
com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384)
at 
com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466)
at 
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at 
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)

I think the bug occures because there is no valid state transition from Alert 
via PingTimeout to something recoverable

Status.java
s_fsm.addTransition(Status.Alert, Event.AgentConnected, 
Status.Connecting);
s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up);
s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed);
s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, 
Status.Alert);
s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, 
Status.Alert);
s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, 
Status.Disconnected);

 As a workaround to get out of this situation we put the cluster in Unmanage, 
wait 10 minutes and put the cluster back in manage



--
This message was sent by Atlassian

[jira] [Commented] (CLOUDSTACK-7839) Unable to live migrate an instance to another host in a cluster from which the template has been deleted

2014-11-04 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196092#comment-14196092
 ] 

Joris van Lieshout commented on CLOUDSTACK-7839:


Additional information:

The public boolean "storagePoolHasEnoughSpace" in StorageManagerImpl.java has a 
loop that goes through all volumes. The second if statement in the loop is 
where the NP exception is thrown because _templateDao.findById returns no 
templates

   for (Volume volume : volumes) {
if (volume.getTemplateId() != null) {
VMTemplateVO tmpl = 
_templateDao.findById(volume.getTemplateId());
if (tmpl.getFormat() != ImageFormat.ISO) {
allocatedSizeWithtemplate = 
_capacityMgr.getAllocatedPoolCapacity(poolVO, tmpl);
}
}
if (volume.getState() != Volume.State.Ready) {
totalAskingSize = totalAskingSize + 
getVolumeSizeIncludingHvSsReserve(volume, pool);
}
}

This SQL statement will show that the removed field of vm_template is not null 
causeing findById to return nothing.
select vm_template.name, vm_template.removed from vm_instance join vm_template 
on vm_instance.vm_template_id=vm_template.id where vm_instance.name like 
'%testinstancefromtmpl1%';

vm_template.name, vm_template.removed
'testinstancetmp','2014-11-04 09:21:34'

> Unable to live migrate an instance to another host in a cluster from which 
> the template has been deleted
> 
>
> Key: CLOUDSTACK-7839
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7839
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Template
>Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
>Reporter: Joris van Lieshout
>Priority: Critical
>
> ACS throws an null pointer exception when you try to live migrate an instance 
> to another host in a cluster and the template of that instance has been 
> deleted.
> I have pasted the exception below.
> Steps to reproduce the issue:
> 1. create an instance from iso
> 2. stop the instance
> 3. create a template from the root volume
> 4. create a new instance from that template
> 5. leave the instance running
> 6. delete the template
> 7. try the live migrate the instance to another host in the cluster
> The migrate button in the web interface will not respond.
> The exception below can be found in the management-server log 
> 2014-11-04 14:08:45,509 ERROR [cloud.api.ApiServer] 
> (TP-Processor49:ctx-35286d62 ctx-3de77f98) unhandled exception executing api 
> command: findHostsForMigration
> java.lang.NullPointerException
> at 
> com.cloud.storage.StorageManagerImpl.storagePoolHasEnoughSpace(StorageManagerImpl.java:1561)
> at 
> org.apache.cloudstack.storage.allocator.AbstractStoragePoolAllocator.filter(AbstractStoragePoolAllocator.java:199)
> at 
> org.apache.cloudstack.storage.allocator.ClusterScopeStoragePoolAllocator.select(ClusterScopeStoragePoolAllocator.java:110)
> at 
> org.apache.cloudstack.storage.allocator.AbstractStoragePoolAllocator.allocateToPool(AbstractStoragePoolAllocator.java:109)
> at 
> com.cloud.server.ManagementServerImpl.findSuitablePoolsForVolumes(ManagementServerImpl.java:1250)
> at 
> com.cloud.server.ManagementServerImpl.listHostsForMigrationOfVM(ManagementServerImpl.java:1150)
> at sun.reflect.GeneratedMethodAccessor643.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:622)
> at 
> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
> at 
> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
> at 
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
> at 
> org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
> at 
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
> at 
> org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
> at com.sun.proxy.$Proxy193.listHostsForMigrationOfVM(Unknown Source)
> at 
> org.apache.cloudstack.api.command.admin.host.FindHostsForMigrationCmd.execute(FindHostsForMigrationCmd.java:75)
> at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161)
> at com.cloud.api.ApiServer.queueCommand(Ap

[jira] [Created] (CLOUDSTACK-7839) Unable to live migrate an instance to another host in a cluster from which the template has been deleted

2014-11-04 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-7839:
--

 Summary: Unable to live migrate an instance to another host in a 
cluster from which the template has been deleted
 Key: CLOUDSTACK-7839
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7839
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
  Components: Template
Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
Reporter: Joris van Lieshout
Priority: Critical


ACS throws an null pointer exception when you try to live migrate an instance 
to another host in a cluster and the template of that instance has been deleted.
I have pasted the exception below.

Steps to reproduce the issue:
1. create an instance from iso
2. stop the instance
3. create a template from the root volume
4. create a new instance from that template
5. leave the instance running
6. delete the template
7. try the live migrate the instance to another host in the cluster
The migrate button in the web interface will not respond.
The exception below can be found in the management-server log 


2014-11-04 14:08:45,509 ERROR [cloud.api.ApiServer] 
(TP-Processor49:ctx-35286d62 ctx-3de77f98) unhandled exception executing api 
command: findHostsForMigration
java.lang.NullPointerException
at 
com.cloud.storage.StorageManagerImpl.storagePoolHasEnoughSpace(StorageManagerImpl.java:1561)
at 
org.apache.cloudstack.storage.allocator.AbstractStoragePoolAllocator.filter(AbstractStoragePoolAllocator.java:199)
at 
org.apache.cloudstack.storage.allocator.ClusterScopeStoragePoolAllocator.select(ClusterScopeStoragePoolAllocator.java:110)
at 
org.apache.cloudstack.storage.allocator.AbstractStoragePoolAllocator.allocateToPool(AbstractStoragePoolAllocator.java:109)
at 
com.cloud.server.ManagementServerImpl.findSuitablePoolsForVolumes(ManagementServerImpl.java:1250)
at 
com.cloud.server.ManagementServerImpl.listHostsForMigrationOfVM(ManagementServerImpl.java:1150)
at sun.reflect.GeneratedMethodAccessor643.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at 
org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
at 
org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
at 
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at 
org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at com.sun.proxy.$Proxy193.listHostsForMigrationOfVM(Unknown Source)
at 
org.apache.cloudstack.api.command.admin.host.FindHostsForMigrationCmd.execute(FindHostsForMigrationCmd.java:75)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161)
at com.cloud.api.ApiServer.queueCommand(ApiServer.java:531)
at com.cloud.api.ApiServer.handleRequest(ApiServer.java:374)
at com.cloud.api.ApiServlet.processRequestInContext(ApiServlet.java:323)
at com.cloud.api.ApiServlet.access$000(ApiServlet.java:53)
at com.cloud.api.ApiServlet$1.run(ApiServlet.java:115)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at com.cloud.api.ApiServlet.processRequest(ApiServlet.java:112)
at com.cloud.api.ApiServlet.doGet(ApiServlet.java:74)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.valves.AccessLogValve.invoke(Acce

[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-09-08 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125514#comment-14125514
 ] 

Joris van Lieshout commented on CLOUDSTACK-7184:


Hi, I am currently out of office and will be back Tuesday the 23rd of 
September. During this time I will have limited access to e-mail and might not 
be able to take your call. For urgent matter regarding ASR please contact 
int-...@schubergphilis.com instead. For Cloud IaaS matters please contact 
int-cl...@schubergphilis.com.

Kind regards,
Joris van Lieshout


Schuberg Philis
schubergphilis.com

+31207506672
+31651428188


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

2014-08-21 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105718#comment-14105718
 ] 

Joris van Lieshout commented on CLOUDSTACK-7184:


Hi, I am currently out of office and will be back Wednesday the 27th of August. 
During this time I will have limited access to e-mail and might not be able to 
take your call. For urgent matter regarding ASR please contact 
int-...@schubergphilis.com instead. For other urgent matter please contact one 
of my colleagues.

Kind regards,
Joris van Lieshout


Schuberg Philis
schubergphilis.com

+31207506672
+31651428188


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> 
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Hypervisor Controller, Management Server, XenServer
>Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>Reporter: Remi Bergsma
>Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up: 2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-7319) Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to copy incremental snapshots

2014-08-12 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093971#comment-14093971
 ] 

Joris van Lieshout commented on CLOUDSTACK-7319:


We believe Hot-fix 4 for XS62 sp1 contains a similar fix but for the sparse dd 
process used for the first copy of a chain.

http://support.citrix.com/article/CTX140417

== begin quote ==
Copying a virtual disk between SRs uses the unbuffered I/O to avoid polluting 
the pagecache in the Control Domain (dom0). This reduces the dom0 vCPU overhead 
and allows the pagecache to work more effectively for other operations.
== end quote ==

> Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to 
> copy incremental snapshots
> ---
>
> Key: CLOUDSTACK-7319
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Snapshot, XenServer
>Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, Future, 4.2.1, 
> 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1
>Reporter: Joris van Lieshout
>Priority: Critical
>
> We noticed that the dd process was way to agressive on Dom0 causing all kinds 
> of problems on a xenserver with medium workloads. 
> ACS uses the dd command to copy incremental snapshots to secondary storage. 
> This process is to heavy on Dom0 resources and even impacts DomU performance, 
> and can even lead to domain freezes (including Dom0) of more then a minute. 
> We've found that this is because the Dom0 kernel caches the read and write 
> operations of dd.
> Some of the issues we have seen as a consequence of this are:
> - DomU performance/freezes
> - OVS freeze and not forwarding any traffic
> - Including LACPDUs resulting in the bond going down
> - keepalived heartbeat packets between RRVMs not being send/received 
> resulting in flapping RRVM master state
> - Braking snapshot copy processes
> - the xenserver heartbeat script reaching it's timeout and fencing the server
> - poolmaster connection loss
> - ACS marking the host as down and fencing the instances even though they are 
> still running on the origional host resulting in the same instance running on 
> to hosts in one cluster
> - vhd corruption are a result of some of the issues mentioned above
> We've developed a patch on the xenserver scripts 
> /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input 
> and output files (iflag=direct oflag=direct).
> Our test have shown that Dom0 load during snapshot copy is way lower.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CLOUDSTACK-7319) Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to copy incremental snapshots

2014-08-12 Thread Joris van Lieshout (JIRA)

 [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris van Lieshout updated CLOUDSTACK-7319:
---

Description: 
We noticed that the dd process was way to agressive on Dom0 causing all kinds 
of problems on a xenserver with medium workloads. 
ACS uses the dd command to copy incremental snapshots to secondary storage. 
This process is to heavy on Dom0 resources and even impacts DomU performance, 
and can even lead to domain freezes (including Dom0) of more then a minute. 
We've found that this is because the Dom0 kernel caches the read and write 
operations of dd.
Some of the issues we have seen as a consequence of this are:
- DomU performance/freezes
- OVS freeze and not forwarding any traffic
- Including LACPDUs resulting in the bond going down
- keepalived heartbeat packets between RRVMs not being send/received resulting 
in flapping RRVM master state
- Braking snapshot copy processes
- the xenserver heartbeat script reaching it's timeout and fencing the server
- poolmaster connection loss
- ACS marking the host as down and fencing the instances even though they are 
still running on the origional host resulting in the same instance running on 
to hosts in one cluster
- vhd corruption are a result of some of the issues mentioned above
We've developed a patch on the xenserver scripts 
/etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input and 
output files (iflag=direct oflag=direct).
Our test have shown that Dom0 load during snapshot copy is way lower.

  was:
We noticed that the dd process was way to agressive on Dom0 causing all kinds 
of problems on a xenserver with medium workloads. 
ACS uses the dd command to copy incremental snapshots to secondary storage. 
This process is to heavy on Dom0 resources and even impacts DomU performance, 
and can even lead to domain freezes (including Dom0) of more then a minute. 
We've found that this is because the Dom0 kernel caches the read and write 
operations of dd.
We've developed a patch on the xenserver scripts 
/etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input and 
output files.
Our test have shown that Dom0 load during snapshot copy is way lower. I will 
upload the patch on review.


> Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to 
> copy incremental snapshots
> ---
>
> Key: CLOUDSTACK-7319
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Snapshot, XenServer
>Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, Future, 4.2.1, 
> 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1
>Reporter: Joris van Lieshout
>Priority: Critical
>
> We noticed that the dd process was way to agressive on Dom0 causing all kinds 
> of problems on a xenserver with medium workloads. 
> ACS uses the dd command to copy incremental snapshots to secondary storage. 
> This process is to heavy on Dom0 resources and even impacts DomU performance, 
> and can even lead to domain freezes (including Dom0) of more then a minute. 
> We've found that this is because the Dom0 kernel caches the read and write 
> operations of dd.
> Some of the issues we have seen as a consequence of this are:
> - DomU performance/freezes
> - OVS freeze and not forwarding any traffic
> - Including LACPDUs resulting in the bond going down
> - keepalived heartbeat packets between RRVMs not being send/received 
> resulting in flapping RRVM master state
> - Braking snapshot copy processes
> - the xenserver heartbeat script reaching it's timeout and fencing the server
> - poolmaster connection loss
> - ACS marking the host as down and fencing the instances even though they are 
> still running on the origional host resulting in the same instance running on 
> to hosts in one cluster
> - vhd corruption are a result of some of the issues mentioned above
> We've developed a patch on the xenserver scripts 
> /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input 
> and output files (iflag=direct oflag=direct).
> Our test have shown that Dom0 load during snapshot copy is way lower.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CLOUDSTACK-7319) Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to copy incremental snapshots

2014-08-12 Thread Joris van Lieshout (JIRA)

 [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris van Lieshout updated CLOUDSTACK-7319:
---

Summary: Copy Snapshot command too heavy on XenServer Dom0 resources when 
using dd to copy incremental snapshots  (was: Copy Snapshot command to heavy on 
XenServer Dom0 resources when using dd to copy incremental snapshots)

> Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to 
> copy incremental snapshots
> ---
>
> Key: CLOUDSTACK-7319
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Snapshot, XenServer
>Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, Future, 4.2.1, 
> 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1
>Reporter: Joris van Lieshout
>Priority: Critical
>
> We noticed that the dd process was way to agressive on Dom0 causing all kinds 
> of problems on a xenserver with medium workloads. 
> ACS uses the dd command to copy incremental snapshots to secondary storage. 
> This process is to heavy on Dom0 resources and even impacts DomU performance, 
> and can even lead to domain freezes (including Dom0) of more then a minute. 
> We've found that this is because the Dom0 kernel caches the read and write 
> operations of dd.
> We've developed a patch on the xenserver scripts 
> /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input 
> and output files.
> Our test have shown that Dom0 load during snapshot copy is way lower. I will 
> upload the patch on review.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-7319) Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to copy incremental snapshots

2014-08-12 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093962#comment-14093962
 ] 

Joris van Lieshout commented on CLOUDSTACK-7319:


review https://reviews.apache.org/r/24598/ 

> Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to 
> copy incremental snapshots
> ---
>
> Key: CLOUDSTACK-7319
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Snapshot, XenServer
>Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, Future, 4.2.1, 
> 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1
>Reporter: Joris van Lieshout
>Priority: Critical
>
> We noticed that the dd process was way to agressive on Dom0 causing all kinds 
> of problems on a xenserver with medium workloads. 
> ACS uses the dd command to copy incremental snapshots to secondary storage. 
> This process is to heavy on Dom0 resources and even impacts DomU performance, 
> and can even lead to domain freezes (including Dom0) of more then a minute. 
> We've found that this is because the Dom0 kernel caches the read and write 
> operations of dd.
> Some of the issues we have seen as a consequence of this are:
> - DomU performance/freezes
> - OVS freeze and not forwarding any traffic
> - Including LACPDUs resulting in the bond going down
> - keepalived heartbeat packets between RRVMs not being send/received 
> resulting in flapping RRVM master state
> - Braking snapshot copy processes
> - the xenserver heartbeat script reaching it's timeout and fencing the server
> - poolmaster connection loss
> - ACS marking the host as down and fencing the instances even though they are 
> still running on the origional host resulting in the same instance running on 
> to hosts in one cluster
> - vhd corruption are a result of some of the issues mentioned above
> We've developed a patch on the xenserver scripts 
> /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input 
> and output files (iflag=direct oflag=direct).
> Our test have shown that Dom0 load during snapshot copy is way lower.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CLOUDSTACK-7319) Copy Snapshot command to heavy on XenServer Dom0 resources when using dd to copy incremental snapshots

2014-08-12 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-7319:
--

 Summary: Copy Snapshot command to heavy on XenServer Dom0 
resources when using dd to copy incremental snapshots
 Key: CLOUDSTACK-7319
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
  Components: Snapshot, XenServer
Affects Versions: 4.2.0, 4.1.0, 4.0.2, 4.0.1, 4.0.0, 4.1.1, Future, 4.2.1, 
4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1
Reporter: Joris van Lieshout
Priority: Critical


We noticed that the dd process was way to agressive on Dom0 causing all kinds 
of problems on a xenserver with medium workloads. 
ACS uses the dd command to copy incremental snapshots to secondary storage. 
This process is to heavy on Dom0 resources and even impacts DomU performance, 
and can even lead to domain freezes (including Dom0) of more then a minute. 
We've found that this is because the Dom0 kernel caches the read and write 
operations of dd.
We've developed a patch on the xenserver scripts 
/etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input and 
output files.
Our test have shown that Dom0 load during snapshot copy is way lower. I will 
upload the patch on review.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CLOUDSTACK-7103) Disable in-band management of OVS on cloud_link_local_network on XenServer

2014-07-14 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-7103:
--

 Summary: Disable in-band management of OVS on 
cloud_link_local_network on XenServer
 Key: CLOUDSTACK-7103
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7103
 Project: CloudStack
  Issue Type: Improvement
  Security Level: Public (Anyone can view this level - this is the default.)
  Components: XenServer
Affects Versions: 4.2.0, 4.1.0, 4.0.2, 4.0.1, 4.0.0, 4.1.1, 4.2.1, 4.3.0, 
4.4.0, 4.5.0, 4.3.1
Reporter: Joris van Lieshout


By default XenServer uses Openvswitch and has in-band management enabled on any 
new network. This is not desirable for the cloud_link_local_network. This can 
be disabled by setting the network's other config parameter 
vswitch-disable-in-band to true.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-6308) when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts

2014-05-23 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006931#comment-14006931
 ] 

Joris van Lieshout commented on CLOUDSTACK-6308:


I've tried reproducing this issue in 4.3 but have not been able to so it seems 
resolved. I'll close this bug for now and reopen if needed.

> when executing createNetwork as ROOT for a subdomain/account it checks for 
> network overlap in all subdomains/accounts
> -
>
> Key: CLOUDSTACK-6308
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6308
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: API
>Affects Versions: 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1
>Reporter: Joris van Lieshout
> Fix For: 4.3.0
>
>
> When executing createNetwork with an account from the ROOT domain with a 
> domainid and account specified of a subdomain/account the error below is 
> thrown when the ip range overlaps with a network of another subdomain.
> errorCode: 431, errorText:The IP range has already been added with gateway 
> 192.168.150.1 ,and netmask 255.255.255.0, Please specify the gateway/netmask 
> if you want to extend ip range
> scenario:
> ROOT/ACME has network 192.168.150.0/24 gw 192.168.150.1
> exec createNetwork for ROOT/MEGACORP account johndoe network 192.168.150.0/24 
> gw 192.168.150.1 with ROOT domain credentials.
> workaround for now:
> execute createNetwork with credentials from domain MEGACORP and account 
> johndoe.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (CLOUDSTACK-6308) when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts

2014-05-23 Thread Joris van Lieshout (JIRA)

 [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris van Lieshout closed CLOUDSTACK-6308.
--

   Resolution: Cannot Reproduce
Fix Version/s: 4.3.0

> when executing createNetwork as ROOT for a subdomain/account it checks for 
> network overlap in all subdomains/accounts
> -
>
> Key: CLOUDSTACK-6308
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6308
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: API
>Affects Versions: 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1
>Reporter: Joris van Lieshout
> Fix For: 4.3.0
>
>
> When executing createNetwork with an account from the ROOT domain with a 
> domainid and account specified of a subdomain/account the error below is 
> thrown when the ip range overlaps with a network of another subdomain.
> errorCode: 431, errorText:The IP range has already been added with gateway 
> 192.168.150.1 ,and netmask 255.255.255.0, Please specify the gateway/netmask 
> if you want to extend ip range
> scenario:
> ROOT/ACME has network 192.168.150.0/24 gw 192.168.150.1
> exec createNetwork for ROOT/MEGACORP account johndoe network 192.168.150.0/24 
> gw 192.168.150.1 with ROOT domain credentials.
> workaround for now:
> execute createNetwork with credentials from domain MEGACORP and account 
> johndoe.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CLOUDSTACK-6308) when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts

2014-05-22 Thread Joris van Lieshout (JIRA)

 [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris van Lieshout updated CLOUDSTACK-6308:
---

Priority: Major  (was: Critical)

> when executing createNetwork as ROOT for a subdomain/account it checks for 
> network overlap in all subdomains/accounts
> -
>
> Key: CLOUDSTACK-6308
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6308
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: API
>Affects Versions: 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1
>Reporter: Joris van Lieshout
>
> When executing createNetwork with an account from the ROOT domain with a 
> domainid and account specified of a subdomain/account the error below is 
> thrown when the ip range overlaps with a network of another subdomain.
> errorCode: 431, errorText:The IP range has already been added with gateway 
> 192.168.150.1 ,and netmask 255.255.255.0, Please specify the gateway/netmask 
> if you want to extend ip range
> scenario:
> ROOT/ACME has network 192.168.150.0/24 gw 192.168.150.1
> exec createNetwork for ROOT/MEGACORP account johndoe network 192.168.150.0/24 
> gw 192.168.150.1 with ROOT domain credentials.
> workaround for now:
> execute createNetwork with credentials from domain MEGACORP and account 
> johndoe.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-6308) when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts

2014-05-22 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006920#comment-14006920
 ] 

Joris van Lieshout commented on CLOUDSTACK-6308:


This issues still exists and as far as i know has not yet been fixed. I will 
poke the dev list to see if anyone can have a look.

> when executing createNetwork as ROOT for a subdomain/account it checks for 
> network overlap in all subdomains/accounts
> -
>
> Key: CLOUDSTACK-6308
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6308
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: API
>Affects Versions: 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1
>Reporter: Joris van Lieshout
>Priority: Critical
>
> When executing createNetwork with an account from the ROOT domain with a 
> domainid and account specified of a subdomain/account the error below is 
> thrown when the ip range overlaps with a network of another subdomain.
> errorCode: 431, errorText:The IP range has already been added with gateway 
> 192.168.150.1 ,and netmask 255.255.255.0, Please specify the gateway/netmask 
> if you want to extend ip range
> scenario:
> ROOT/ACME has network 192.168.150.0/24 gw 192.168.150.1
> exec createNetwork for ROOT/MEGACORP account johndoe network 192.168.150.0/24 
> gw 192.168.150.1 with ROOT domain credentials.
> workaround for now:
> execute createNetwork with credentials from domain MEGACORP and account 
> johndoe.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CLOUDSTACK-6751) conntrackd stats logging is enabled by default and fills up /var

2014-05-22 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-6751:
--

 Summary: conntrackd stats logging is enabled by default and fills 
up /var
 Key: CLOUDSTACK-6751
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6751
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
  Components: SystemVM
Affects Versions: 4.3.0
Reporter: Joris van Lieshout


Conntrackd package has a bug where the comment in the default config file 
states that stats logging is disabled by default but the config parameter is 
set to on. The consequence for ACS is that a conntrackd-stats.log file is 
created during the build of the svm. This logfile gets rotated by logrotate 
which has a post action to restart conntrackd. Even if the svm is not a 
redundant router. On vpc routers for instance the stats logging file can grow 
quickly and fill up the /var volume killing the vm.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-6716) /usr has been sized to small and ends up being 100% full on SSVM and CVM

2014-05-20 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003374#comment-14003374
 ] 

Joris van Lieshout commented on CLOUDSTACK-6716:


Created review request https://reviews.apache.org/r/21696/ 

> /usr has been sized to small and ends up being 100% full on SSVM and CVM
> 
>
> Key: CLOUDSTACK-6716
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6716
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: SystemVM
>Affects Versions: Future, 4.3.0, 4.4.0
>Reporter: Joris van Lieshout
>
> The systemvmtemplate for 4.3 and 4.4 has a too small /usr volume and ends up 
> 100% full on Secondary Storage VMs and Console VMs.
> root@v-xxx-VM:~# df -h
> Filesystem  Size  Used Avail Use% 
> Mounted on
> rootfs  276M  144M  118M  55% 
> /
> udev 10M 0   10M   0% 
> /dev
> tmpfs   100M  156K  100M   1% 
> /run
> /dev/disk/by-uuid/0721ecee-214a-4143-8d88-a4075cc2cd89  276M  144M  118M  55% 
> /
> tmpfs   5.0M 0  5.0M   0% 
> /run/lock
> tmpfs   314M 0  314M   0% 
> /run/shm
> /dev/xvda1   45M   22M   21M  51% 
> /boot
> /dev/xvda6   98M  5.6M   88M   6% 
> /home
> /dev/xvda8  368M   11M  339M   3% 
> /opt
> /dev/xvda10  63M  5.3M   55M   9% 
> /tmp
> /dev/xvda7  610M  584M 0 100% 
> /usr
> /dev/xvda9  415M  316M   78M  81% 
> /var



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-6716) /usr has been sized to small and ends up being 100% full on SSVM and CVM

2014-05-20 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003050#comment-14003050
 ] 

Joris van Lieshout commented on CLOUDSTACK-6716:


I already have a solution for this. Will submit the patch on review board today.

> /usr has been sized to small and ends up being 100% full on SSVM and CVM
> 
>
> Key: CLOUDSTACK-6716
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6716
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: SystemVM
>Affects Versions: Future, 4.3.0, 4.4.0
>Reporter: Joris van Lieshout
>
> The systemvmtemplate for 4.3 and 4.4 has a too small /usr volume and ends up 
> 100% full on Secondary Storage VMs and Console VMs.
> root@v-xxx-VM:~# df -h
> Filesystem  Size  Used Avail Use% 
> Mounted on
> rootfs  276M  144M  118M  55% 
> /
> udev 10M 0   10M   0% 
> /dev
> tmpfs   100M  156K  100M   1% 
> /run
> /dev/disk/by-uuid/0721ecee-214a-4143-8d88-a4075cc2cd89  276M  144M  118M  55% 
> /
> tmpfs   5.0M 0  5.0M   0% 
> /run/lock
> tmpfs   314M 0  314M   0% 
> /run/shm
> /dev/xvda1   45M   22M   21M  51% 
> /boot
> /dev/xvda6   98M  5.6M   88M   6% 
> /home
> /dev/xvda8  368M   11M  339M   3% 
> /opt
> /dev/xvda10  63M  5.3M   55M   9% 
> /tmp
> /dev/xvda7  610M  584M 0 100% 
> /usr
> /dev/xvda9  415M  316M   78M  81% 
> /var



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CLOUDSTACK-6716) /usr has been sized to small and ends up being 100% full on SSVM and CVM

2014-05-20 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-6716:
--

 Summary: /usr has been sized to small and ends up being 100% full 
on SSVM and CVM
 Key: CLOUDSTACK-6716
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6716
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
  Components: SystemVM
Affects Versions: Future, 4.3.0, 4.4.0
Reporter: Joris van Lieshout


The systemvmtemplate for 4.3 and 4.4 has a too small /usr volume and ends up 
100% full on Secondary Storage VMs and Console VMs.

root@v-xxx-VM:~# df -h
Filesystem  Size  Used Avail Use% 
Mounted on
rootfs  276M  144M  118M  55% /
udev 10M 0   10M   0% 
/dev
tmpfs   100M  156K  100M   1% 
/run
/dev/disk/by-uuid/0721ecee-214a-4143-8d88-a4075cc2cd89  276M  144M  118M  55% /
tmpfs   5.0M 0  5.0M   0% 
/run/lock
tmpfs   314M 0  314M   0% 
/run/shm
/dev/xvda1   45M   22M   21M  51% 
/boot
/dev/xvda6   98M  5.6M   88M   6% 
/home
/dev/xvda8  368M   11M  339M   3% 
/opt
/dev/xvda10  63M  5.3M   55M   9% 
/tmp
/dev/xvda7  610M  584M 0 100% 
/usr
/dev/xvda9  415M  316M   78M  81% 
/var



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CLOUDSTACK-6308) when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts

2014-03-31 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-6308:
--

 Summary: when executing createNetwork as ROOT for a 
subdomain/account it checks for network overlap in all subdomains/accounts
 Key: CLOUDSTACK-6308
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6308
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
  Components: API
Affects Versions: 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1
Reporter: Joris van Lieshout
Priority: Critical


When executing createNetwork with an account from the ROOT domain with a 
domainid and account specified of a subdomain/account the error below is thrown 
when the ip range overlaps with a network of another subdomain.

errorCode: 431, errorText:The IP range has already been added with gateway 
192.168.150.1 ,and netmask 255.255.255.0, Please specify the gateway/netmask if 
you want to extend ip range

scenario:
ROOT/ACME has network 192.168.150.0/24 gw 192.168.150.1
exec createNetwork for ROOT/MEGACORP account johndoe network 192.168.150.0/24 
gw 192.168.150.1 with ROOT domain credentials.

workaround for now:
execute createNetwork with credentials from domain MEGACORP and account johndoe.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CLOUDSTACK-6223) removeNicFromVirtualMachine fails if another instance in another domain has a nic with the same ip and a forwarding rule configured on it

2014-03-11 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-6223:
--

 Summary: removeNicFromVirtualMachine fails if another instance in 
another domain has a nic with the same ip and a forwarding rule configured on it
 Key: CLOUDSTACK-6223
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6223
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
Affects Versions: 4.2.1
Reporter: Joris van Lieshout
Priority: Blocker


When removeNicFromVirtualMachine is called for a nic on an instance the code 
below is evaluated. This piece of code searches for portforwarding rules across 
all domains. If another instance exists that has a nic with the same ip and a 
forwarding rule the search returns >1 and the removeNicFromVirtualMachine call 
failed. 

server/src/com/cloud/network/rules/RulesManagerImpl.java
@Override
public List listAssociatedRulesForGuestNic(Nic nic){
List result = new ArrayList();
// add PF rules
result.addAll(_portForwardingDao.listByDestIpAddr(nic.getIp4Address()));
// add static NAT rules

Stack trace:
2014-03-11 15:24:04,944 ERROR [cloud.async.AsyncJobManagerImpl] 
(Job-Executor-102:job-193607 = [ 30e81de3-2a00-49f2-8d80-545a765e4c1e ]) 
Unexpected exception while executing 
org.apache.cloudstack.api.command.user.vm.RemoveNicFromVMCmd
com.cloud.utils.exception.CloudRuntimeException: Failed to remove nic from 
VM[User|zzz1] in Ntwk[994|Guest|14], nic has associated Port forwarding or Load 
balancer or Static NAT rules.
at 
com.cloud.vm.VirtualMachineManagerImpl.removeNicFromVm(VirtualMachineManagerImpl.java:3058)
at 
com.cloud.vm.UserVmManagerImpl.removeNicFromVirtualMachine(UserVmManagerImpl.java:1031)
at 
org.apache.cloudstack.api.command.user.vm.RemoveNicFromVMCmd.execute(RemoveNicFromVMCmd.java:103)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:158)
at 
com.cloud.async.AsyncJobManagerImpl$1.run(AsyncJobManagerImpl.java:531)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-6195) an ACS db upgraded from Pre-4.0 version is missing unique key constraint on host_details

2014-03-04 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919341#comment-13919341
 ] 

Joris van Lieshout commented on CLOUDSTACK-6195:


Hi Wei Zhou,

We were really looking forward to 4.x I guess. :) Anyway, this explains the 
issue. We've already fixed the constraint and will be doing a schema compare to 
make sure this was the only discrepancy. I've created this ticket as a courtesy 
just in case any one else would run into this. Good to hear we're probably the 
only one. :)
For me case closed as "non-issue".

Thanks again!

> an ACS db upgraded from Pre-4.0 version is missing unique key constraint on 
> host_details
> 
>
> Key: CLOUDSTACK-6195
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6195
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Upgrade
>Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, 4.2.1, 4.1.2
> Environment: Pre-4.0 db upgraded to 4.x. We have confirmed this bug 
> in a db that started out as 2.2.14. 
>Reporter: Joris van Lieshout
>
> This is the table in our 4.2.1 env that has been upgraded from 2.2.14.
> CREATE TABLE `host_details` (
>   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
>   `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
>   `name` varchar(255) NOT NULL,
>   `value` varchar(255) NOT NULL,
>   PRIMARY KEY (`id`),
>   KEY `fk_host_details__host_id` (`host_id`),
>   CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
> `host` (`id`) ON DELETE CASCADE
> ) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8;
> And this is the table of a fresh 4.x install:
> CREATE TABLE `host_details` (
>   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
>   `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
>   `name` varchar(255) NOT NULL,
>   `value` varchar(255) NOT NULL,
>   PRIMARY KEY (`id`),
>   UNIQUE KEY `uk_host_id_name` (`host_id`,`name`),
>   KEY `fk_host_details__host_id` (`host_id`),
>   CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
> `host` (`id`) ON DELETE CASCADE
> ) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8;
> The effect of this missing bug is a lot of duplicate entries in the 
> host_details table. The duplicate information on the host_details table 
> causes the api call listHosts to return the same host tag multiple time (to 
> be exact: the number of duplicate entries in the host_details table for that 
> host).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-6195) an ACS db upgraded from Pre-4.0 version is missing unique key constraint on host_details

2014-03-04 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919294#comment-13919294
 ] 

Joris van Lieshout commented on CLOUDSTACK-6195:


Hi Wei Zhou,

Thank you for having a look. If I check the schema-create script of 2.2.14 
(https://github.com/CloudStack-extras/CloudStack-archive/blob/2.2.14/setup/db/create-schema.sql)
 I see that the constraint is not there. I will check if the scripts of 3.0.0, 
3.0.1 and 3.0.2 as well and update this ticket.
Our upgrade path up until 4.0 is the same as yours.

1   2.2.14.20120210102939   2012-03-20 19:46:38 Complete
2   3.0.0   2012-06-22 12:48:19 Complete
3   3.0.1   2012-06-22 12:48:19 Complete
4   3.0.2   2012-06-22 12:48:19 Complete
7   4.0.0   2012-08-21 13:00:14 Complete
9   4.0.1   2013-02-13 12:36:24 Complete
11  4.0.2   2013-04-23 07:21:08 Complete
13  4.1.0   2013-07-16 09:43:23 Complete
15  4.1.1   2013-07-16 09:43:23 Complete
17  4.2.0   2013-12-18 09:38:25 Complete
19  4.2.1   2013-12-18 09:38:25 Complete

> an ACS db upgraded from Pre-4.0 version is missing unique key constraint on 
> host_details
> 
>
> Key: CLOUDSTACK-6195
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6195
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Upgrade
>Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, 4.2.1, 4.1.2
> Environment: Pre-4.0 db upgraded to 4.x. We have confirmed this bug 
> in a db that started out as 2.2.14. 
>Reporter: Joris van Lieshout
>
> This is the table in our 4.2.1 env that has been upgraded from 2.2.14.
> CREATE TABLE `host_details` (
>   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
>   `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
>   `name` varchar(255) NOT NULL,
>   `value` varchar(255) NOT NULL,
>   PRIMARY KEY (`id`),
>   KEY `fk_host_details__host_id` (`host_id`),
>   CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
> `host` (`id`) ON DELETE CASCADE
> ) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8;
> And this is the table of a fresh 4.x install:
> CREATE TABLE `host_details` (
>   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
>   `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
>   `name` varchar(255) NOT NULL,
>   `value` varchar(255) NOT NULL,
>   PRIMARY KEY (`id`),
>   UNIQUE KEY `uk_host_id_name` (`host_id`,`name`),
>   KEY `fk_host_details__host_id` (`host_id`),
>   CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
> `host` (`id`) ON DELETE CASCADE
> ) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8;
> The effect of this missing bug is a lot of duplicate entries in the 
> host_details table. The duplicate information on the host_details table 
> causes the api call listHosts to return the same host tag multiple time (to 
> be exact: the number of duplicate entries in the host_details table for that 
> host).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CLOUDSTACK-6195) an ACS db upgraded from Pre-4.0 version is missing unique key constraint on host_details

2014-03-03 Thread Joris van Lieshout (JIRA)

 [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris van Lieshout updated CLOUDSTACK-6195:
---

Description: 
This is the table in our 4.2.1 env that has been upgraded from 2.2.14.

CREATE TABLE `host_details` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
  `name` varchar(255) NOT NULL,
  `value` varchar(255) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `fk_host_details__host_id` (`host_id`),
  CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
`host` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8;

And this is the table of a fresh 4.x install:

CREATE TABLE `host_details` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
  `name` varchar(255) NOT NULL,
  `value` varchar(255) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_host_id_name` (`host_id`,`name`),
  KEY `fk_host_details__host_id` (`host_id`),
  CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
`host` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8;

The effect of this missing bug is a lot of duplicate entries in the 
host_details table. The duplicate information on the host_details table causes 
the api call listHosts to return the same host tag multiple time (to be exact: 
the number of duplicate entries in the host_details table for that host).

  was:
This is the table in our 4.2.1 env that has been upgraded from 2.2.14.

CREATE TABLE `host_details` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
  `name` varchar(255) NOT NULL,
  `value` varchar(255) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `fk_host_details__host_id` (`host_id`),
  CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
`host` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8;

And this is the table of a fresh 4.x install:

CREATE TABLE `host_details` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
  `name` varchar(255) NOT NULL,
  `value` varchar(255) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_host_id_name` (`host_id`,`name`),
  KEY `fk_host_details__host_id` (`host_id`),
  CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
`host` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8;


> an ACS db upgraded from Pre-4.0 version is missing unique key constraint on 
> host_details
> 
>
> Key: CLOUDSTACK-6195
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6195
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Upgrade
>Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, 4.2.1, 4.1.2
> Environment: Pre-4.0 db upgraded to 4.x. We have confirmed this bug 
> in a db that started out as 2.2.14. 
>Reporter: Joris van Lieshout
>
> This is the table in our 4.2.1 env that has been upgraded from 2.2.14.
> CREATE TABLE `host_details` (
>   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
>   `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
>   `name` varchar(255) NOT NULL,
>   `value` varchar(255) NOT NULL,
>   PRIMARY KEY (`id`),
>   KEY `fk_host_details__host_id` (`host_id`),
>   CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
> `host` (`id`) ON DELETE CASCADE
> ) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8;
> And this is the table of a fresh 4.x install:
> CREATE TABLE `host_details` (
>   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
>   `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
>   `name` varchar(255) NOT NULL,
>   `value` varchar(255) NOT NULL,
>   PRIMARY KEY (`id`),
>   UNIQUE KEY `uk_host_id_name` (`host_id`,`name`),
>   KEY `fk_host_details__host_id` (`host_id`),
>   CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
> `host` (`id`) ON DELETE CASCADE
> ) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8;
> The effect of this missing bug is a lot of duplicate entries in the 
> host_details table. The duplicate information on the host_details table 
> causes the api call listHosts to return the same host tag multiple time (to 
> be exact: the number of duplicate entries in the host_details table for that 
> host).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CLOUDSTACK-6195) an ACS db upgraded from Pre-4.0 version is missing unique key constraint on host_details

2014-03-03 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-6195:
--

 Summary: an ACS db upgraded from Pre-4.0 version is missing unique 
key constraint on host_details
 Key: CLOUDSTACK-6195
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6195
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
  Components: Upgrade
Affects Versions: 4.2.0, 4.1.0, 4.0.2, 4.0.1, 4.0.0, 4.1.1, 4.2.1, 4.1.2
 Environment: Pre-4.0 db upgraded to 4.x. We have confirmed this bug in 
a db that started out as 2.2.14. 
Reporter: Joris van Lieshout


This is the table in our 4.2.1 env that has been upgraded from 2.2.14.

CREATE TABLE `host_details` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
  `name` varchar(255) NOT NULL,
  `value` varchar(255) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `fk_host_details__host_id` (`host_id`),
  CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
`host` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8;

And this is the table of a fresh 4.x install:

CREATE TABLE `host_details` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id',
  `name` varchar(255) NOT NULL,
  `value` varchar(255) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_host_id_name` (`host_id`,`name`),
  KEY `fk_host_details__host_id` (`host_id`),
  CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES 
`host` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8;



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits

2014-02-05 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892005#comment-13892005
 ] 

Joris van Lieshout commented on CLOUDSTACK-6023:


We will be installing on our test env a custom build of 4.2.1 that has a max of 
16 today. I should be able to answer your question in a couple day. 
Theoretically however looking at the current size of the POST and the number of 
instance with vcpumax=32 setting it to 16 will make a big difference.

> Non windows instances are created on XenServer with a vcpu-max above 
> supported xenserver limits
> ---
>
> Key: CLOUDSTACK-6023
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: XenServer
>Affects Versions: Future, 4.2.1, 4.3.0
>Reporter: Joris van Lieshout
>Priority: Blocker
> Attachments: xentop.png
>
>
> CitrixResourceBase.java contains a hardcoded value for vcpusmax for non 
> windows instances:
> if (guestOsTypeName.toLowerCase().contains("windows")) {
> vmr.VCPUsMax = (long) vmSpec.getCpus();
> } else {
> vmr.VCPUsMax = 32L;
> }
> For all currently available versions of XenServer the limit is 16vcpus:
> http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf
> http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf
> http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf
> In addition there seems to be a limit to the total amount of assigned vpcus 
> on a XenServer.
> The impact of this bug is that xapi becomes unstable and keeps losing it's 
> master_connection because the POST to the /remote_db_access is bigger then 
> it's limit of 200K. This basically renders a pool slave unmanageable. 
> If you would look at the running instances using xentop you will see hosts 
> reporting with 32 vcpus
> Below the relevant portion of the xensource.log that shows the effect of the 
> bug:
> [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: Using commandline: 
> /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6
> [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork 
> (43,30540))
> [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: stunnel start
> [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40
> [20140204T13:52:17.346Z|error|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] Received HTTP error 500 ({ method = POST; 
> uri = /remote_db_access; query = [  ]; content_length = [ 315932 ]; transfer 
> encoding = ; version = 1.1; cookie = [ 
> pool_secret=386bbf39-8710-4d2d-f452-9725d79c2393/aa7bcda9-8ebb-0cef-bb77-c6b496c5d859/1f928d82-7a20-9117-dd30-f96c7349b16e
>  ]; task = ; subtask_of = ; content-type = ; user_agent = xapi/1.9 }) from 
> master. This suggests our master address is wrong. Sleeping for 60s and then 
> restarting.
> [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Caught Master_connection.Goto_handler
> [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Connection to master died. I will continue 
> to retry indefinitely (supressing future logging of this message).
> [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Connection to master died. I will continue 
> to retry indefinitely (supressing future logging of this message).
> [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Sleeping 2.00 seconds before retrying 
> master connection...
> [20140204T13:53:20.627Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel: Using commandline: 
> /usr/sbin/stunnel -fd 3c8aed8e-1fce-be7c-09f8-b45cdc40a1f5
> [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel: stunnel has pidty: (FEFork 
> (23,31207))
> [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel: stunnel start
> [20140204T13:53:20.632Z| info|xense

[jira] [Commented] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits

2014-02-05 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891982#comment-13891982
 ] 

Joris van Lieshout commented on CLOUDSTACK-6023:


That is a good idea. Nice solution.

> Non windows instances are created on XenServer with a vcpu-max above 
> supported xenserver limits
> ---
>
> Key: CLOUDSTACK-6023
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: XenServer
>Affects Versions: Future, 4.2.1, 4.3.0
>Reporter: Joris van Lieshout
>Priority: Blocker
> Attachments: xentop.png
>
>
> CitrixResourceBase.java contains a hardcoded value for vcpusmax for non 
> windows instances:
> if (guestOsTypeName.toLowerCase().contains("windows")) {
> vmr.VCPUsMax = (long) vmSpec.getCpus();
> } else {
> vmr.VCPUsMax = 32L;
> }
> For all currently available versions of XenServer the limit is 16vcpus:
> http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf
> http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf
> http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf
> In addition there seems to be a limit to the total amount of assigned vpcus 
> on a XenServer.
> The impact of this bug is that xapi becomes unstable and keeps losing it's 
> master_connection because the POST to the /remote_db_access is bigger then 
> it's limit of 200K. This basically renders a pool slave unmanageable. 
> If you would look at the running instances using xentop you will see hosts 
> reporting with 32 vcpus
> Below the relevant portion of the xensource.log that shows the effect of the 
> bug:
> [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: Using commandline: 
> /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6
> [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork 
> (43,30540))
> [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: stunnel start
> [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40
> [20140204T13:52:17.346Z|error|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] Received HTTP error 500 ({ method = POST; 
> uri = /remote_db_access; query = [  ]; content_length = [ 315932 ]; transfer 
> encoding = ; version = 1.1; cookie = [ 
> pool_secret=386bbf39-8710-4d2d-f452-9725d79c2393/aa7bcda9-8ebb-0cef-bb77-c6b496c5d859/1f928d82-7a20-9117-dd30-f96c7349b16e
>  ]; task = ; subtask_of = ; content-type = ; user_agent = xapi/1.9 }) from 
> master. This suggests our master address is wrong. Sleeping for 60s and then 
> restarting.
> [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Caught Master_connection.Goto_handler
> [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Connection to master died. I will continue 
> to retry indefinitely (supressing future logging of this message).
> [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Connection to master died. I will continue 
> to retry indefinitely (supressing future logging of this message).
> [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Sleeping 2.00 seconds before retrying 
> master connection...
> [20140204T13:53:20.627Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel: Using commandline: 
> /usr/sbin/stunnel -fd 3c8aed8e-1fce-be7c-09f8-b45cdc40a1f5
> [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel: stunnel has pidty: (FEFork 
> (23,31207))
> [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel: stunnel start
> [20140204T13:53:20.632Z| info|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel connected pid=31207 fd=20
> [20140204T13:53:28.874Z|error|xenserverhost1|4 
> unix-RPC|session.login_with_password D:2e7664ad69ed|master_connection] Caught 
> Master_connection.Goto_ha

[jira] [Updated] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits

2014-02-05 Thread Joris van Lieshout (JIRA)

 [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris van Lieshout updated CLOUDSTACK-6023:
---

Attachment: xentop.png

> Non windows instances are created on XenServer with a vcpu-max above 
> supported xenserver limits
> ---
>
> Key: CLOUDSTACK-6023
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: XenServer
>Affects Versions: Future, 4.2.1, 4.3.0
>Reporter: Joris van Lieshout
>Priority: Blocker
> Attachments: xentop.png
>
>
> CitrixResourceBase.java contains a hardcoded value for vcpusmax for non 
> windows instances:
> if (guestOsTypeName.toLowerCase().contains("windows")) {
> vmr.VCPUsMax = (long) vmSpec.getCpus();
> } else {
> vmr.VCPUsMax = 32L;
> }
> For all currently available versions of XenServer the limit is 16vcpus:
> http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf
> http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf
> http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf
> In addition there seems to be a limit to the total amount of assigned vpcus 
> on a XenServer.
> The impact of this bug is that xapi becomes unstable and keeps losing it's 
> master_connection because the POST to the /remote_db_access is bigger then 
> it's limit of 200K. This basically renders a pool slave unmanageable. 
> If you would look at the running instances using xentop you will see hosts 
> reporting with 32 vcpus
> Below the relevant portion of the xensource.log that shows the effect of the 
> bug:
> [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: Using commandline: 
> /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6
> [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork 
> (43,30540))
> [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: stunnel start
> [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40
> [20140204T13:52:17.346Z|error|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] Received HTTP error 500 ({ method = POST; 
> uri = /remote_db_access; query = [  ]; content_length = [ 315932 ]; transfer 
> encoding = ; version = 1.1; cookie = [ 
> pool_secret=386bbf39-8710-4d2d-f452-9725d79c2393/aa7bcda9-8ebb-0cef-bb77-c6b496c5d859/1f928d82-7a20-9117-dd30-f96c7349b16e
>  ]; task = ; subtask_of = ; content-type = ; user_agent = xapi/1.9 }) from 
> master. This suggests our master address is wrong. Sleeping for 60s and then 
> restarting.
> [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Caught Master_connection.Goto_handler
> [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Connection to master died. I will continue 
> to retry indefinitely (supressing future logging of this message).
> [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Connection to master died. I will continue 
> to retry indefinitely (supressing future logging of this message).
> [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Sleeping 2.00 seconds before retrying 
> master connection...
> [20140204T13:53:20.627Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel: Using commandline: 
> /usr/sbin/stunnel -fd 3c8aed8e-1fce-be7c-09f8-b45cdc40a1f5
> [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel: stunnel has pidty: (FEFork 
> (23,31207))
> [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel: stunnel start
> [20140204T13:53:20.632Z| info|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] stunnel connected pid=31207 fd=20
> [20140204T13:53:28.874Z|error|xenserverhost1|4 
> unix-RPC|session.login_with_password D:2e7664ad69ed|master_connection] Caught 
> Master_connection.Goto_handler
> [20140204T13:53:28.874Z|debug|xenserverhost1|4 
> unix-RPC

[jira] [Comment Edited] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits

2014-02-05 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891934#comment-13891934
 ] 

Joris van Lieshout edited comment on CLOUDSTACK-6023 at 2/5/14 9:03 AM:


Hi Hrikrishna,

We came to this conclusion by using tcpdump to capture the POST that got 
returned with a http 500 error from the pool master. This post, which exceeded 
the 300k limit of xapi rpc, contained for each vm the stats for each of the 32 
vpcus (even though the instances where just using 1 vcpu) thus making this post 
exceed the 300K limit. We are encountering this issue on a host running just 59 
instances (inc 36 router vms that use just 1 vcpu but have a vcpumax of 32). 

My suggestion to resolve this issue would be to make the vcpu-max a 
configurable variable of a service/compute offering with a default of 
vcpusmax=vcpus unless otherwise configured in the offering.

in addition I do wonder why there is is a descrepency between the XenServer 
Configuration Limits documentation and the documents you are refering to. In 
the end we are actively experiencing this issue. I've attached a screen print 
of xentop on one of our xenserver 6.0.2 host with this issue.

If it will helps I can attach the packet capture with the post?


was (Author: jvanliesh...@schubergphilis.com):
Hi Hrikrishna,

We came to this conclusion by using tcpdump to capture the POST that got 
returned with a http 500 error from the pool master. This post, which exceeded 
the 300k limit of xapi rpc, contained for each vm the stats for each of the 32 
vpcus (even though the instances where just using 1 vcpu) thus making this post 
exceed the 300K limit. We are encountering this issue on a host running just 59 
instances (inc 36 router vms that use just 1 vcpu but have a vcpumax of 32). 

My suggestion to resolve this issue would be to make the vcpu-max a 
configurable variable of a service/compute offering with a default of 
vcpusmax=vcpus unless otherwise configured in the offering.

in addition I do wonder why there is is a descrepency between the XenServer 
Configuration Limits documentation and the documents you are refering to. In 
the end we are actively experiencing this issue. I've attached a screen print 
of xentop on one of our xenserver 6.0.2 host with this issue.


> Non windows instances are created on XenServer with a vcpu-max above 
> supported xenserver limits
> ---
>
> Key: CLOUDSTACK-6023
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: XenServer
>Affects Versions: Future, 4.2.1, 4.3.0
>Reporter: Joris van Lieshout
>Priority: Blocker
> Attachments: xentop.png
>
>
> CitrixResourceBase.java contains a hardcoded value for vcpusmax for non 
> windows instances:
> if (guestOsTypeName.toLowerCase().contains("windows")) {
> vmr.VCPUsMax = (long) vmSpec.getCpus();
> } else {
> vmr.VCPUsMax = 32L;
> }
> For all currently available versions of XenServer the limit is 16vcpus:
> http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf
> http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf
> http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf
> In addition there seems to be a limit to the total amount of assigned vpcus 
> on a XenServer.
> The impact of this bug is that xapi becomes unstable and keeps losing it's 
> master_connection because the POST to the /remote_db_access is bigger then 
> it's limit of 200K. This basically renders a pool slave unmanageable. 
> If you would look at the running instances using xentop you will see hosts 
> reporting with 32 vcpus
> Below the relevant portion of the xensource.log that shows the effect of the 
> bug:
> [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: Using commandline: 
> /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6
> [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork 
> (43,30540))
> [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: stunnel start
> [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40
> [20140204T13:52:17.346Z|error|xenserve

[jira] [Commented] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits

2014-02-05 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891934#comment-13891934
 ] 

Joris van Lieshout commented on CLOUDSTACK-6023:


Hi Hrikrishna,

We came to this conclusion by using tcpdump to capture the POST that got 
returned with a http 500 error from the pool master. This post, which exceeded 
the 300k limit of xapi rpc, contained for each vm the stats for each of the 32 
vpcus (even though the instances where just using 1 vcpu) thus making this post 
exceed the 300K limit. We are encountering this issue on a host running just 59 
instances (inc 36 router vms that use just 1 vcpu but have a vcpumax of 32). 

My suggestion to resolve this issue would be to make the vcpu-max a 
configurable variable of a service/compute offering with a default of 
vcpusmax=vcpus unless otherwise configured in the offering.

in addition I do wonder why there is is a descrepency between the XenServer 
Configuration Limits documentation and the documents you are refering to. In 
the end we are actively experiencing this issue. I've attached a screen print 
of xentop on one of our xenserver 6.0.2 host with this issue.


> Non windows instances are created on XenServer with a vcpu-max above 
> supported xenserver limits
> ---
>
> Key: CLOUDSTACK-6023
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: XenServer
>Affects Versions: Future, 4.2.1, 4.3.0
>Reporter: Joris van Lieshout
>Priority: Blocker
>
> CitrixResourceBase.java contains a hardcoded value for vcpusmax for non 
> windows instances:
> if (guestOsTypeName.toLowerCase().contains("windows")) {
> vmr.VCPUsMax = (long) vmSpec.getCpus();
> } else {
> vmr.VCPUsMax = 32L;
> }
> For all currently available versions of XenServer the limit is 16vcpus:
> http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf
> http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf
> http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf
> In addition there seems to be a limit to the total amount of assigned vpcus 
> on a XenServer.
> The impact of this bug is that xapi becomes unstable and keeps losing it's 
> master_connection because the POST to the /remote_db_access is bigger then 
> it's limit of 200K. This basically renders a pool slave unmanageable. 
> If you would look at the running instances using xentop you will see hosts 
> reporting with 32 vcpus
> Below the relevant portion of the xensource.log that shows the effect of the 
> bug:
> [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: Using commandline: 
> /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6
> [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork 
> (43,30540))
> [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel: stunnel start
> [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40
> [20140204T13:52:17.346Z|error|xenserverhost1|144 inet-RPC|host.call_plugin 
> R:e58e985539ab|master_connection] Received HTTP error 500 ({ method = POST; 
> uri = /remote_db_access; query = [  ]; content_length = [ 315932 ]; transfer 
> encoding = ; version = 1.1; cookie = [ 
> pool_secret=386bbf39-8710-4d2d-f452-9725d79c2393/aa7bcda9-8ebb-0cef-bb77-c6b496c5d859/1f928d82-7a20-9117-dd30-f96c7349b16e
>  ]; task = ; subtask_of = ; content-type = ; user_agent = xapi/1.9 }) from 
> master. This suggests our master address is wrong. Sleeping for 60s and then 
> restarting.
> [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Caught Master_connection.Goto_handler
> [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Connection to master died. I will continue 
> to retry indefinitely (supressing future logging of this message).
> [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update 
> D:5c5376f0da6c|master_connection] Connection to master died. I will continue 
> to retry indefinitely (supressing future logging of this message).
> [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 

[jira] [Created] (CLOUDSTACK-6024) template copy to primary storage uses a random source secstorage from any zone

2014-02-04 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-6024:
--

 Summary: template copy to primary storage uses a random source 
secstorage from any zone
 Key: CLOUDSTACK-6024
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6024
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
Affects Versions: 4.2.1, 4.3.0
 Environment: Multiple zones where the secstorage of a zone is not 
accessible to hosts from the other zone.
Reporter: Joris van Lieshout
Priority: Critical


2014-02-04 15:19:07,674 DEBUG [cloud.storage.VolumeManagerImpl] 
(Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) 
Checking if we need to prepare 1 volumes for VM[User|xx-app01]

2014-02-04 15:19:07,693 DEBUG [storage.image.TemplateDataFactoryImpl] 
(Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) 
template 467 is already in store:117, type:Image

// store 117 is not accessible from the zone where this hypervisor lives

2014-02-04 15:19:07,705 DEBUG [storage.datastore.PrimaryDataStoreImpl] 
(Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) Not 
found (templateId:467poolId:208) in template_spool_ref, persisting it

2014-02-04 15:19:07,718 DEBUG [storage.image.TemplateDataFactoryImpl] 
(Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) 
template 467 is already in store:208, type:Primary

2014-02-04 15:19:07,722 DEBUG [storage.volume.VolumeServiceImpl] 
(Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) Found 
template 467-2-6c05b599-95ed-34c3-b8f0-fd9c30bac938 in storage pool 208 with 
VMTemplateStoragePool id: 36433

2014-02-04 15:19:07,732 DEBUG [storage.volume.VolumeServiceImpl] 
(Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) Acquire 
lock on VMTemplateStoragePool 36433 with timeout 3600 seconds

2014-02-04 15:19:07,737 INFO  [storage.volume.VolumeServiceImpl] 
(Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) lock is 
acquired for VMTemplateStoragePool 36433

2014-02-04 15:19:07,748 DEBUG [storage.motion.AncientDataMotionStrategy] 
(Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) 
copyAsync inspecting src type TEMPLATE copyAsync inspecting dest type TEMPLATE

2014-02-04 15:19:07,775 DEBUG [agent.manager.ClusteredAgentAttache] 
(Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) Seq 
93-1862347354: Forwarding Seq 93-1862347354:  { Cmd , MgmtId: 345052370018, 
via: 93, Ver: v1, Flags: 100111, 
[{"org.apache.cloudstack.storage.command.CopyCommand":{"srcTO":{"org.apache.cloudstack.storage.to.TemplateObjectTO":{"path":"template/tmpl/2/467/c263eb76-3d72-3732-8cc6-42b0dad55c4d.vhd","origUrl":"http://x.x.com/image/centos64x64-daily-v1b104.vhd","uuid":"ca5e3f26-e9b6-41c8-a85b-df900be5673c","id":467,"format":"VHD","accountId":2,"checksum":"604a8327bd83850ed621ace2ea84402a","hvm":true,"displayText":"centos
 template created by hans.pl from machine name 
centos-daily-b104","imageDataStore":{"com.cloud.agent.api.to.NfsTO":{"_url":"nfs://.storage..xx.xxx/volumes/pool0/--1-1","_role":"Image"}},"name":"467-2-6c05b599-95ed-34c3-b8f0-fd9c30bac938","hypervisorType":"XenServer"}},"destTO":{"org.apache.cloudstack.storage.to.TemplateObjectTO":{"origUrl":"http://xx.xx.com/image/centos64x64-daily-v1b104.vhd","uuid":"ca5e3f26-e9b6-41c8-a85b-df900be5673c","id":467,"format":"VHD","accountId":2,"checksum":"604a8327bd83850ed621ace2ea84402a","hvm":true,"displayText":"centos
 template created by hans.pl from machine name 
centos-daily-b104","imageDataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"b290385b-466d-3243-a939-3d242164e034","id":208,"poolType":"NetworkFilesystem","host":"..x.net","path":"/volumes/pool0/xx-XEN-1","port":2049}},"name":"467-2-6c05b599-95ed-34c3-b8f0-fd9c30bac938","hypervisorType":"XenServer"}},"executeInSequence":true,"wait":10800}}]
 } to 345052370017







===FILE: server/src/com/cloud/storage/VolumeManagerImpl.java

public void prepare(VirtualMachineProfile vm,

DeployDestination dest) throws StorageUnavailableException,

InsufficientStorageCapacityException, ConcurrentOperationException {



if (dest == null) {

if (s_logger.isDebugEnabled()) {

s_logger.debug("DeployDestination cannot be null, cannot 
prepare Volumes for the vm: "

+ vm);

}

throw new CloudRuntimeException(

"Unable to prepare Volume for vm because DeployDestination 
is null, vm:"

+ vm);

}

List vols = _volsDao.findUsableVolumesForInstance(vm.getId());

if (s_logger.isDebugEnabled()) {

s_logger.debug("C

[jira] [Created] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits

2014-02-04 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-6023:
--

 Summary: Non windows instances are created on XenServer with a 
vcpu-max above supported xenserver limits
 Key: CLOUDSTACK-6023
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
  Components: XenServer
Affects Versions: 4.2.1
Reporter: Joris van Lieshout
Priority: Blocker


CitrixResourceBase.java contains a hardcoded value for vcpusmax for non windows 
instances:
if (guestOsTypeName.toLowerCase().contains("windows")) {
vmr.VCPUsMax = (long) vmSpec.getCpus();
} else {
vmr.VCPUsMax = 32L;
}

For all currently available versions of XenServer the limit is 16vcpus:
http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf
http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf
http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf

In addition there seems to be a limit to the total amount of assigned vpcus on 
a XenServer.

The impact of this bug is that xapi becomes unstable and keeps losing it's 
master_connection because the POST to the /remote_db_access is bigger then it's 
limit of 200K. This basically renders a pool slave unmanageable. 

If you would look at the running instances using xentop you will see hosts 
reporting with 32 vcpus

Below the relevant portion of the xensource.log that shows the effect of the 
bug:
[20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
R:e58e985539ab|master_connection] stunnel: Using commandline: /usr/sbin/stunnel 
-fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6
[20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork 
(43,30540))
[20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin 
R:e58e985539ab|master_connection] stunnel: stunnel start
[20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin 
R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40
[20140204T13:52:17.346Z|error|xenserverhost1|144 inet-RPC|host.call_plugin 
R:e58e985539ab|master_connection] Received HTTP error 500 ({ method = POST; uri 
= /remote_db_access; query = [  ]; content_length = [ 315932 ]; transfer 
encoding = ; version = 1.1; cookie = [ 
pool_secret=386bbf39-8710-4d2d-f452-9725d79c2393/aa7bcda9-8ebb-0cef-bb77-c6b496c5d859/1f928d82-7a20-9117-dd30-f96c7349b16e
 ]; task = ; subtask_of = ; content-type = ; user_agent = xapi/1.9 }) from 
master. This suggests our master address is wrong. Sleeping for 60s and then 
restarting.
[20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update 
D:5c5376f0da6c|master_connection] Caught Master_connection.Goto_handler
[20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update 
D:5c5376f0da6c|master_connection] Connection to master died. I will continue to 
retry indefinitely (supressing future logging of this message).
[20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update 
D:5c5376f0da6c|master_connection] Connection to master died. I will continue to 
retry indefinitely (supressing future logging of this message).
[20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update 
D:5c5376f0da6c|master_connection] Sleeping 2.00 seconds before retrying 
master connection...
[20140204T13:53:20.627Z|debug|xenserverhost1|10|dom0 networking update 
D:5c5376f0da6c|master_connection] stunnel: Using commandline: /usr/sbin/stunnel 
-fd 3c8aed8e-1fce-be7c-09f8-b45cdc40a1f5
[20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update 
D:5c5376f0da6c|master_connection] stunnel: stunnel has pidty: (FEFork 
(23,31207))
[20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update 
D:5c5376f0da6c|master_connection] stunnel: stunnel start
[20140204T13:53:20.632Z| info|xenserverhost1|10|dom0 networking update 
D:5c5376f0da6c|master_connection] stunnel connected pid=31207 fd=20
[20140204T13:53:28.874Z|error|xenserverhost1|4 
unix-RPC|session.login_with_password D:2e7664ad69ed|master_connection] Caught 
Master_connection.Goto_handler
[20140204T13:53:28.874Z|debug|xenserverhost1|4 
unix-RPC|session.login_with_password D:2e7664ad69ed|master_connection] 
Connection to master died. I will continue to retry indefinitely (supressing 
future logging of this message).
[20140204T13:53:28.874Z|error|xenserverhost1|4 
unix-RPC|session.login_with_password D:2e7664ad69ed|master_connection] 
Connection to master died. I will continue to retry indefinitely (supressing 
future logging of this message).
[20140204T13:53:28.8

[jira] [Created] (CLOUDSTACK-6020) createPortForwardingRule failes for vmguestip above 127.255.255.255

2014-02-04 Thread Joris van Lieshout (JIRA)
Joris van Lieshout created CLOUDSTACK-6020:
--

 Summary: createPortForwardingRule failes for vmguestip above 
127.255.255.255
 Key: CLOUDSTACK-6020
 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6020
 Project: CloudStack
  Issue Type: Bug
  Security Level: Public (Anyone can view this level - this is the default.)
  Components: API
Affects Versions: 4.2.0, 4.1.0, 4.0.2, 4.0.1, 4.0.0, pre-4.0.0, 4.1.1, 
Future, 4.2.1, 4.1.2, 4.3.0, 4.4.0
Reporter: Joris van Lieshout


command=createPortForwardingRule&response=json&sessionkey=FmHQb9oGmgKlM4ihB%2Fb2ik7p35E%3D&ipaddressid=d29bebfe-edc1-406f-b4ed-7a49c6e7ee1f&privateport=80&privateendport=80&publicport=80&publicendport=80&protocol=tcp&virtualmachineid=cc5c9dc4-3eeb-4533-994a-0e2636a48a60&openfirewall=false&vmguestip=192.168.1.30&networkid=5e56227c-83c0-4b85-8a27-53343e806d12&_=1391510423905

vmguestip=192.168.1.30

api/src/org/apache/cloudstack/api/command/user/firewall/CreatePortForwardingRuleCmd.java
@Parameter(name = ApiConstants.VM_GUEST_IP, type = CommandType.STRING, required 
= false,
description = "VM guest nic Secondary ip address for the port forwarding 
rule")
private String vmSecondaryIp;

@Override
public void create() {
// cidr list parameter is deprecated
if (cidrlist != null) {
throw new InvalidParameterValueException("Parameter cidrList is 
deprecated; if you need to open firewall rule for the specific cidr, please 
refer to createFirewallRule command");
}

Ip privateIp = getVmSecondaryIp();
if (privateIp != null) {
if ( !privateIp.isIp4()) {
throw new InvalidParameterValueException("Invalid vm ip 
address");
}
}

try {
PortForwardingRule result = 
_rulesService.createPortForwardingRule(this, virtualMachineId, privateIp, 
getOpenFirewall());
setEntityId(result.getId());
setEntityUuid(result.getUuid());
} catch (NetworkRuleConflictException ex) {
s_logger.info("Network rule conflict: " , ex);
s_logger.trace("Network Rule Conflict: ", ex);
throw new 
ServerApiException(ApiErrorCode.NETWORK_RULE_CONFLICT_ERROR, ex.getMessage());
}
}

utils/src/com/cloud/utils/net/Ip.java
public boolean isIp4() {
return ip < Integer.MAX_VALUE;
}

public Ip(String ip) {
this.ip = NetUtils.ip2Long(ip);
}

=== ip2long for 192.168.1.30 => 3232235806

=== Integer.MAX_VALUE => 231-1 = 2147483647

3232235806 (192.168.1.30) is therefore bigger then MAX_VALUE making isIp4() 
return FALSE and throwing a InvalidParameterValueException…



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (CLOUDSTACK-692) The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in the process of being created.

2013-09-17 Thread Joris van Lieshout (JIRA)

 [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris van Lieshout updated CLOUDSTACK-692:
--

Summary: The CleanupSnapshotBackup process on SSVM deletes snapshots that 
are still in the process of being created.  (was: The StorageManager-Scavenger 
deletes snapshots that are still in the process of being created.)

> The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in 
> the process of being created.
> ---
>
> Key: CLOUDSTACK-692
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-692
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Snapshot
>Reporter: Joris van Lieshout
>Priority: Minor
>
> Hi there,
> I think we ran into a bug due to a concurrence of circumstances regarding 
> snapshotting and the cleanup of snapshots.
> The CleanupSnapshotBackup process on the SSVM deletes vhd files that are not 
> known in the database but when, especially long running snapshot, are being 
> copied to secondary storeage there is a gap between the start and finish of 
> the VDI-copy, where the uuid of the destination vhd is not registered in the 
> database. If the CleanupSnapshotBackup deletes the destinaion vhd during this 
> window it results in hanging sparse_dd process on the XenServer hypervisor 
> pointing to a tapdisk2 process with no file behind it.
> ===Secondary storage vm (2 hour time difference due to time zone). second to 
> last line you see the vhd being deleted.
> 2013-09-04 03:14:45,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) 
> Request:Seq 261-1870805144:  { Cmd , MgmtId: 345052370018, via: 261, Ver: v1, 
> Flags: 100011, 
> [{"CleanupSnapshotBackupCommand":{"secondaryStoragePoolURL":"nfs://mccpnas7.storage.mccp.mcinfra.net/volumes/pool0/MCCP-SHARED-1-1","dcId":1,"accountId":45,"volumeId":5863,"validBackupUUIDs":["1a56760b-d1c0-4620-8cf7-271951500d70","b6157bc9-085b-4ed6-95c2-4341f31c64bf","1ff967e3-3606-4112-9155-b1145b2ef576","12fbe4e3-1fdd-4c35-a961-0fce07cff584","278e9915-4f94-40c8-bef4-9c6bc82d4653","6fba1dd7-4736-47b3-9eed-148304c0e192","b9d8c9d8-6445-463b-b4e1-ab3b3f3a67a2","40ba5d72-c69a-46c2-973b-0570c1cabeac","774f2b0e-cdaf-4594-a9f9-4f872dcaad6e","8269f50b-6bec-427c-8186-540df6a75dbf","7b0c6e75-40cf-4dd7-826a-09b39f3da7b5","df7eac9c-137a-4655-9d21-d781916351f1","11ec2db1-a2fc-4221-ae1a-c1ab2bd59509","dfc348e1-af50-4d77-b4a0-6e86fc954e1c","98f64c0f-7498-4c70-8b70-beaefd723b45","c42f9dd5-079d-4b77-86dc-c19b7fbed817"],"wait":0}}]
>  }
> 2013-09-04 03:14:45,722 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) 
> Processing command: com.cloud.agent.api.CleanupSnapshotBackupCommand
> 2013-09-04 03:14:45,723 DEBUG [storage.resource.NfsSecondaryStorageResource] 
> (agentRequest-Handler-2:) Executing: mount 
> 2013-09-04 03:14:45,732 DEBUG [storage.resource.NfsSecondaryStorageResource] 
> (agentRequest-Handler-2:) Execution is successful.
> 2013-09-04 03:14:45,772 WARN  [storage.resource.NfsSecondaryStorageResource] 
> (agentRequest-Handler-2:) snapshot 8ca9fea4-8a98-4cc3-bba7-cc1dcf32bb24.vhd 
> is not recorded in DB, remove it
> 2013-09-04 03:14:45,772 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) 
> Seq 261-1870805144:  { Ans: , MgmtId: 345052370018, via: 261, Ver: v1, Flags: 
> 10, [{"Answer":{"result":true,"wait":0}}] }
>  management-server.log. here you see the snapshot being created, the 
> copyToSecStorage process starting, eventually timing out due to the hanging 
> vdi-copy, failing on retrying because vdi in use (although not existing any 
> more the vdi is still know on xen), retrying some more on another HV and 
> eventuall giving up because it tries to create a duplicate SR.
> 2013-09-04 04:27:10,931 DEBUG [cloud.async.AsyncJobManagerImpl] 
> (Job-Executor-69:job-95137) Executing 
> com.cloud.api.commands.CreateSnapshotCmd for job-95137
> 2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] 
> (Job-Executor-69:job-95137) Seq 91-780303147: Sending  { Cmd , MgmtId: 
> 345052370017, via: 91, Ver: v1, Flags: 100011, 
> [{"ManageSnapshotCommand":{"_commandSwitch":"-c","_volumePath":"9cb7af90-ca88-4b34-aa6f-bc21c3d4a3aa","_pool":{"id":208,"uuid":"b290385b-466d-3243-a939-3d242164e034","host":"mccpnas3-4-vip1.mccp.mcinfra.net","path":"/volumes/pool0/MCCP-S-SBP1-1_MCCP-XEN-1","port":2049,"type":"NetworkFilesystem"},"_snapshotName":"vlstws3_ROOT-2736_20130904022710","_snapshotId":71889,"_vmName":"i-45-2736-VM","wait":0}}]
>  }
> 2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] 
> (Job-Executor-69:job-95137) Seq 91-780303147: Executing:  { Cmd , MgmtId: 
> 345052370017, via: 91, Ver: v1, Flags: 100011, 
> [{"ManageSnapshotComma

[jira] [Updated] (CLOUDSTACK-692) The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in the process of being copied to secondary storage.

2013-09-17 Thread Joris van Lieshout (JIRA)

 [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris van Lieshout updated CLOUDSTACK-692:
--

Summary: The CleanupSnapshotBackup process on SSVM deletes snapshots that 
are still in the process of being copied to secondary storage.  (was: The 
CleanupSnapshotBackup process on SSVM deletes snapshots that are still in the 
process of being created.)

> The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in 
> the process of being copied to secondary storage.
> ---
>
> Key: CLOUDSTACK-692
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-692
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Snapshot
>Reporter: Joris van Lieshout
>Priority: Minor
>
> Hi there,
> I think we ran into a bug due to a concurrence of circumstances regarding 
> snapshotting and the cleanup of snapshots.
> The CleanupSnapshotBackup process on the SSVM deletes vhd files that are not 
> known in the database but when, especially long running snapshot, are being 
> copied to secondary storeage there is a gap between the start and finish of 
> the VDI-copy, where the uuid of the destination vhd is not registered in the 
> database. If the CleanupSnapshotBackup deletes the destinaion vhd during this 
> window it results in hanging sparse_dd process on the XenServer hypervisor 
> pointing to a tapdisk2 process with no file behind it.
> ===Secondary storage vm (2 hour time difference due to time zone). second to 
> last line you see the vhd being deleted.
> 2013-09-04 03:14:45,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) 
> Request:Seq 261-1870805144:  { Cmd , MgmtId: 345052370018, via: 261, Ver: v1, 
> Flags: 100011, 
> [{"CleanupSnapshotBackupCommand":{"secondaryStoragePoolURL":"nfs://mccpnas7.storage.mccp.mcinfra.net/volumes/pool0/MCCP-SHARED-1-1","dcId":1,"accountId":45,"volumeId":5863,"validBackupUUIDs":["1a56760b-d1c0-4620-8cf7-271951500d70","b6157bc9-085b-4ed6-95c2-4341f31c64bf","1ff967e3-3606-4112-9155-b1145b2ef576","12fbe4e3-1fdd-4c35-a961-0fce07cff584","278e9915-4f94-40c8-bef4-9c6bc82d4653","6fba1dd7-4736-47b3-9eed-148304c0e192","b9d8c9d8-6445-463b-b4e1-ab3b3f3a67a2","40ba5d72-c69a-46c2-973b-0570c1cabeac","774f2b0e-cdaf-4594-a9f9-4f872dcaad6e","8269f50b-6bec-427c-8186-540df6a75dbf","7b0c6e75-40cf-4dd7-826a-09b39f3da7b5","df7eac9c-137a-4655-9d21-d781916351f1","11ec2db1-a2fc-4221-ae1a-c1ab2bd59509","dfc348e1-af50-4d77-b4a0-6e86fc954e1c","98f64c0f-7498-4c70-8b70-beaefd723b45","c42f9dd5-079d-4b77-86dc-c19b7fbed817"],"wait":0}}]
>  }
> 2013-09-04 03:14:45,722 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) 
> Processing command: com.cloud.agent.api.CleanupSnapshotBackupCommand
> 2013-09-04 03:14:45,723 DEBUG [storage.resource.NfsSecondaryStorageResource] 
> (agentRequest-Handler-2:) Executing: mount 
> 2013-09-04 03:14:45,732 DEBUG [storage.resource.NfsSecondaryStorageResource] 
> (agentRequest-Handler-2:) Execution is successful.
> 2013-09-04 03:14:45,772 WARN  [storage.resource.NfsSecondaryStorageResource] 
> (agentRequest-Handler-2:) snapshot 8ca9fea4-8a98-4cc3-bba7-cc1dcf32bb24.vhd 
> is not recorded in DB, remove it
> 2013-09-04 03:14:45,772 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) 
> Seq 261-1870805144:  { Ans: , MgmtId: 345052370018, via: 261, Ver: v1, Flags: 
> 10, [{"Answer":{"result":true,"wait":0}}] }
>  management-server.log. here you see the snapshot being created, the 
> copyToSecStorage process starting, eventually timing out due to the hanging 
> vdi-copy, failing on retrying because vdi in use (although not existing any 
> more the vdi is still know on xen), retrying some more on another HV and 
> eventuall giving up because it tries to create a duplicate SR.
> 2013-09-04 04:27:10,931 DEBUG [cloud.async.AsyncJobManagerImpl] 
> (Job-Executor-69:job-95137) Executing 
> com.cloud.api.commands.CreateSnapshotCmd for job-95137
> 2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] 
> (Job-Executor-69:job-95137) Seq 91-780303147: Sending  { Cmd , MgmtId: 
> 345052370017, via: 91, Ver: v1, Flags: 100011, 
> [{"ManageSnapshotCommand":{"_commandSwitch":"-c","_volumePath":"9cb7af90-ca88-4b34-aa6f-bc21c3d4a3aa","_pool":{"id":208,"uuid":"b290385b-466d-3243-a939-3d242164e034","host":"mccpnas3-4-vip1.mccp.mcinfra.net","path":"/volumes/pool0/MCCP-S-SBP1-1_MCCP-XEN-1","port":2049,"type":"NetworkFilesystem"},"_snapshotName":"vlstws3_ROOT-2736_20130904022710","_snapshotId":71889,"_vmName":"i-45-2736-VM","wait":0}}]
>  }
> 2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] 
> (Job-Executor-69:job-95137) Seq 91-780303147: Executing:  { Cmd , MgmtId: 

[jira] [Commented] (CLOUDSTACK-692) The StorageManager-Scavenger deletes snapshots that are still in the process of being created.

2013-09-17 Thread Joris van Lieshout (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13769529#comment-13769529
 ] 

Joris van Lieshout commented on CLOUDSTACK-692:
---

How to clean up on XenServer after you have hit this bug:

1.  Find the sparse_dd process
ps -ef | grep sparse_dd
2.  Find the vbd of the destination sparse_dd device
xe vbd-list device=${dest device f.i. xvbd} vm-uuid=${UUID of Dom0}
3.  Find tapdisk2 process for this vbd
xe vbd-param-get uuid=${UUID VBD step2} param-name=vdi-uuid
tap-ctl list | grep ${uuid of VDI}
ls ${path of vhd from tap-ctl list}
4.  Also get the uuid and name of the SR for later use
xe vdi-param-get uuid=${ uuid of VDI } param-name=sr-uuid
xe vdi-param-get uuid=${ uuid of VDI } param-name=sr-name-label
5.  ONLY continue if the vhd does not exist
6.  Create a dummy file to make the cleanup process go smooth
touch ${path of vhd from tap-ctl list but with .raw instead of .vhd}
7.  Kill the sparse_dd process
kill -9 ${PID of sparse_dd process step 1}
8.  !!! It can take up to 10 minutes for this process to be killed. Only 
continue when the process is gone !!!
ps –ef | grep ${PID of sparse_dd process step 1}
9.  Close, detach and free the tapdisk2 process. Get your info from the 
previous tap-ctl list
tap-ctl close -m ${TAPMINOR} -p ${TAPPID}
tap-ctl detach -m ${TAPMINOR} -p ${TAPPID}
tap-ctl free -m ${TAPMINOR}
10. Now unplug the vbd but put it in background because the process 
sometimes hangs
xe vbd-unplug uuid=${uuid of VBD} &
11. If the vbd unplug hangs check in /var/log/xensource.log to see if it 
hangs on “watching xenstore paths: [ 
/local/domain/0/backend/vbd/0/51712/shutdown-done; 
/local/domain/0/error/device/vbd/51712/error ] with timeout 1200.00 
seconds” by searching for the last line containing VBD.unplug. If so, AND ONLY 
IF SO, execute:
xenstore-write /local/domain/0/backend/vbd/0/${get this from the 
xensourse.log}/shutdown-done Ok
12. It’s now safe to forget all the vdi’s and unplug the pbd and forget the 
sr. The script below will also do it for stuff on other HVs in the cluster if 
CS has tried snapshotting there.
DESTSRs=`xe sr-list name-label=${name-label of sr (looks like uuid) from step 
4, not the uuid of the sr.} --minimal | tr "," "\n"`
for SRloop in $DESTSRs
do
  PBD=`xe sr-param-get uuid=$SRloop param-name=PBDs`
  VDIs=`xe sr-param-get uuid=$SRloop param-name=VDIs | sed 's/;\ */\n/g'`
  for VDIloop in $VDIs
  do
echo "  Forgetting VDI $VDIloop"
xe vdi-forget uuid=$VDIloop
  done
echo "  Unplugging PBD $PBD"
xe pbd-unplug uuid=$PBD
echo "  Forgetting SR $SRloop"
xe sr-forget uuid=$SRloop
done
13. And now everything is ready for another snapshot attempt. Let’s hope 
the Storage Cleanup process keeps its cool. ;)


> The StorageManager-Scavenger deletes snapshots that are still in the process 
> of being created.
> --
>
> Key: CLOUDSTACK-692
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-692
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Snapshot
>Reporter: Joris van Lieshout
>Priority: Minor
>
> Hi there,
> I think we ran into a bug due to a concurrence of circumstances regarding 
> snapshotting and the cleanup of snapshots.
> The CleanupSnapshotBackup process on the SSVM deletes vhd files that are not 
> known in the database but when, especially long running snapshot, are being 
> copied to secondary storeage there is a gap between the start and finish of 
> the VDI-copy, where the uuid of the destination vhd is not registered in the 
> database. If the CleanupSnapshotBackup deletes the destinaion vhd during this 
> window it results in hanging sparse_dd process on the XenServer hypervisor 
> pointing to a tapdisk2 process with no file behind it.
> ===Secondary storage vm (2 hour time difference due to time zone). second to 
> last line you see the vhd being deleted.
> 2013-09-04 03:14:45,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) 
> Request:Seq 261-1870805144:  { Cmd , MgmtId: 345052370018, via: 261, Ver: v1, 
> Flags: 100011, 
> [{"CleanupSnapshotBackupCommand":{"secondaryStoragePoolURL":"nfs://mccpnas7.storage.mccp.mcinfra.net/volumes/pool0/MCCP-SHARED-1-1","dcId":1,"accountId":45,"volumeId":5863,"validBackupUUIDs":["1a56760b-d1c0-4620-8cf7-271951500d70","b6157bc9-085b-4ed6-95c2-4341f31c64bf","1ff967e3-3606-4112-9155-b1145b2ef576","12fbe4e3-1fdd-4c35-a961-0fce07cff584","278e9915-4f94-40c8-bef4-9c6bc82d4653","6fba1dd7-4736-47b3-9eed-148304c0e192","b9d8c9d8-6445-463b-b4e1-ab3b3f3a67a2","40ba5d

[jira] [Updated] (CLOUDSTACK-692) The StorageManager-Scavenger deletes snapshots that are still in the process of being created.

2013-09-17 Thread Joris van Lieshout (JIRA)

 [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris van Lieshout updated CLOUDSTACK-692:
--

Description: 
Hi there,

I think we ran into a bug due to a concurrence of circumstances regarding 
snapshotting and the cleanup of snapshots.

The CleanupSnapshotBackup process on the SSVM deletes vhd files that are not 
known in the database but when, especially long running snapshot, are being 
copied to secondary storeage there is a gap between the start and finish of the 
VDI-copy, where the uuid of the destination vhd is not registered in the 
database. If the CleanupSnapshotBackup deletes the destinaion vhd during this 
window it results in hanging sparse_dd process on the XenServer hypervisor 
pointing to a tapdisk2 process with no file behind it.

===Secondary storage vm (2 hour time difference due to time zone). second to 
last line you see the vhd being deleted.
2013-09-04 03:14:45,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) 
Request:Seq 261-1870805144:  { Cmd , MgmtId: 345052370018, via: 261, Ver: v1, 
Flags: 100011, 
[{"CleanupSnapshotBackupCommand":{"secondaryStoragePoolURL":"nfs://mccpnas7.storage.mccp.mcinfra.net/volumes/pool0/MCCP-SHARED-1-1","dcId":1,"accountId":45,"volumeId":5863,"validBackupUUIDs":["1a56760b-d1c0-4620-8cf7-271951500d70","b6157bc9-085b-4ed6-95c2-4341f31c64bf","1ff967e3-3606-4112-9155-b1145b2ef576","12fbe4e3-1fdd-4c35-a961-0fce07cff584","278e9915-4f94-40c8-bef4-9c6bc82d4653","6fba1dd7-4736-47b3-9eed-148304c0e192","b9d8c9d8-6445-463b-b4e1-ab3b3f3a67a2","40ba5d72-c69a-46c2-973b-0570c1cabeac","774f2b0e-cdaf-4594-a9f9-4f872dcaad6e","8269f50b-6bec-427c-8186-540df6a75dbf","7b0c6e75-40cf-4dd7-826a-09b39f3da7b5","df7eac9c-137a-4655-9d21-d781916351f1","11ec2db1-a2fc-4221-ae1a-c1ab2bd59509","dfc348e1-af50-4d77-b4a0-6e86fc954e1c","98f64c0f-7498-4c70-8b70-beaefd723b45","c42f9dd5-079d-4b77-86dc-c19b7fbed817"],"wait":0}}]
 }
2013-09-04 03:14:45,722 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) 
Processing command: com.cloud.agent.api.CleanupSnapshotBackupCommand
2013-09-04 03:14:45,723 DEBUG [storage.resource.NfsSecondaryStorageResource] 
(agentRequest-Handler-2:) Executing: mount 
2013-09-04 03:14:45,732 DEBUG [storage.resource.NfsSecondaryStorageResource] 
(agentRequest-Handler-2:) Execution is successful.
2013-09-04 03:14:45,772 WARN  [storage.resource.NfsSecondaryStorageResource] 
(agentRequest-Handler-2:) snapshot 8ca9fea4-8a98-4cc3-bba7-cc1dcf32bb24.vhd is 
not recorded in DB, remove it
2013-09-04 03:14:45,772 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) Seq 
261-1870805144:  { Ans: , MgmtId: 345052370018, via: 261, Ver: v1, Flags: 10, 
[{"Answer":{"result":true,"wait":0}}] }


 management-server.log. here you see the snapshot being created, the 
copyToSecStorage process starting, eventually timing out due to the hanging 
vdi-copy, failing on retrying because vdi in use (although not existing any 
more the vdi is still know on xen), retrying some more on another HV and 
eventuall giving up because it tries to create a duplicate SR.
2013-09-04 04:27:10,931 DEBUG [cloud.async.AsyncJobManagerImpl] 
(Job-Executor-69:job-95137) Executing com.cloud.api.commands.CreateSnapshotCmd 
for job-95137
2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] 
(Job-Executor-69:job-95137) Seq 91-780303147: Sending  { Cmd , MgmtId: 
345052370017, via: 91, Ver: v1, Flags: 100011, 
[{"ManageSnapshotCommand":{"_commandSwitch":"-c","_volumePath":"9cb7af90-ca88-4b34-aa6f-bc21c3d4a3aa","_pool":{"id":208,"uuid":"b290385b-466d-3243-a939-3d242164e034","host":"mccpnas3-4-vip1.mccp.mcinfra.net","path":"/volumes/pool0/MCCP-S-SBP1-1_MCCP-XEN-1","port":2049,"type":"NetworkFilesystem"},"_snapshotName":"vlstws3_ROOT-2736_20130904022710","_snapshotId":71889,"_vmName":"i-45-2736-VM","wait":0}}]
 }
2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] 
(Job-Executor-69:job-95137) Seq 91-780303147: Executing:  { Cmd , MgmtId: 
345052370017, via: 91, Ver: v1, Flags: 100011, 
[{"ManageSnapshotCommand":{"_commandSwitch":"-c","_volumePath":"9cb7af90-ca88-4b34-aa6f-bc21c3d4a3aa","_pool":{"id":208,"uuid":"b290385b-466d-3243-a939-3d242164e034","host":"mccpnas3-4-vip1.mccp.mcinfra.net","path":"/volumes/pool0/MCCP-S-SBP1-1_MCCP-XEN-1","port":2049,"type":"NetworkFilesystem"},"_snapshotName":"vlstws3_ROOT-2736_20130904022710","_snapshotId":71889,"_vmName":"i-45-2736-VM","wait":0}}]
 }
2013-09-04 04:27:12,949 DEBUG [agent.transport.Request] 
(Job-Executor-69:job-95137) Seq 91-780303147: Received:  { Ans: , MgmtId: 
345052370017, via: 91, Ver: v1, Flags: 10, { ManageSnapshotAnswer } }
2013-09-04 04:27:12,991 DEBUG [agent.transport.Request] 
(Job-Executor-69:job-95137) Seq 91-780303148: Sending  { Cmd , MgmtId: 
345052370017, via: 91, Ver: v1, Flags: 100011, 
[{"BackupSnapshotCommand":{"isVolumeInactive":false,"vmName":"i-45-2736-VM","snapshotId":71889,"pool":{"id":208,"uuid":"b2903

[jira] [Updated] (CLOUDSTACK-692) The StorageManager-Scavenger deletes snapshots that are still in the process of being created.

2013-09-17 Thread Joris van Lieshout (JIRA)

 [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris van Lieshout updated CLOUDSTACK-692:
--

Summary: The StorageManager-Scavenger deletes snapshots that are still in 
the process of being created.  (was: The StorageManager-Scavenger deletes 
snapshots that are still in the process of being created at that time when the 
volume has older snapshots that do need scavenging)

> The StorageManager-Scavenger deletes snapshots that are still in the process 
> of being created.
> --
>
> Key: CLOUDSTACK-692
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-692
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Snapshot
>Reporter: Joris van Lieshout
>Priority: Minor
>
> Hi there,
> I think we ran into a bug due to a concurrence of circumstances regarding 
> snapshotting and the cleanup of snapshots.
> The StorageManager-Scavenger instructs the StorageVM to delete a snapshot 
> that is still in the process of being created on a hypervisor at that time 
> when the volume has older snapshots that do need scavenging.
>  The SR gets mounted for the snapshot to be created on.
> 2012-12-16 08:02:53,831 DEBUG [xen.resource.CitrixResourceBase] 
> (DirectAgent-293:null) Host 192.168.###.42 
> OpaqueRef:fae7f8be-8cf1-7b84-3d30-7202e172b530: Created a SR; UUID is 
> 1f7530d8-4615-c220-7f37-0
> 5862ddbfe3b device config is 
> {serverpath=/pool0/-###-dc-1-sec1/snapshots/163/1161, 
> server=192.168.###.14}
>  The SMlog on the xenserver show that at this time the snapshot is still 
> being created.
> 2012-12-16 08:37:08,768 DEBUG [agent.transport.Request] 
> (StorageManager-Scavenger-1:null) Seq 159-1958616345: Sending  { Cmd , 
> MgmtId: 345052433504, via: 159, Ver: v1, Flags: 100011, [{"CleanupSnapshot
> BackupCommand":{"secondaryStoragePoolURL":"nfs://192.168.###.14/pool0/-###-dc-1-sec1","dcId":2,"accountId":163,"volumeId":1161,"validBackupUUIDs":["b714a0ee-406e-4100-a75d-bc594391dca9","209bc1dd-f6
> 1a-486c-aecf-335590a907eb"],"wait":0}}] }
>  At this time we start seeing tapdisk errors on the XenServer indicating 
> that the vhd file is gone.
> Dec 16 08:37:08 vm8 tapdisk[26553]: ERROR: errno -116 at vhd_complete: 
> /var/run/sr-mount/1f7530d8-4615-c220-7f37-05862ddbfe3b/073893a6-e9cb-4cf6-8070-c6cf771db5d7.vhd:
>  op: 2, lsec: 448131408, secs:
> 88, nbytes: 45056, blk: 109407, blk_offset: 330368935
> Dec 16 08:37:08 vm8 tapdisk[26553]: ERROR: errno -116 at vhd_complete: 
> /var/run/sr-mount/1f7530d8-4615-c220-7f37-05862ddbfe3b/073893a6-e9cb-4cf6-8070-c6cf771db5d7.vhd:
>  op: 2, lsec: 448131496, secs: 40, nbytes: 20480, blk: 109407, blk_offset: 
> 330368935
> Dec 16 08:37:08 vm8 tapdisk[26553]: ERROR: errno -116 at vhd_complete: 
> /var/run/sr-mount/1f7530d8-4615-c220-7f37-05862ddbfe3b/073893a6-e9cb-4cf6-8070-c6cf771db5d7.vhd:
>  op: 4, lsec: 448131072, secs: 1, nbytes: 512, blk: 109407, blk_offset: 
> 330368935
> Dec 16 08:37:08 vm8 tapdisk[26553]: ERROR: errno -116 at 
> __tapdisk_vbd_complete_td_request: req tap-77.0: write 0x0058 secs @ 
> 0x1ab5f150 - Stale NFS file handle
> Dec 16 08:37:08 vm8 tapdisk[26553]: ERROR: errno -116 at 
> __tapdisk_vbd_complete_td_request: req tap-77.1: write 0x0028 secs @ 
> 0x1ab5f1a8 - Stale NFS file handle

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira