[jira] [Commented] (CLOUDSTACK-7857) CitrixResourceBase wrongly calculates total memory on hosts with a lot of memory and large Dom0
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216355#comment-14216355 ] Joris van Lieshout commented on CLOUDSTACK-7857: I'm not too familiar with mem overhead on other hypervisors. You would think the formula would be some what the same. I understand that ACS has to be as flexible as possible but what if the logic of calculating free mem is moved to the hypervisor plugin so the logic in calculating can be specific but the outcome used by generic processes the same? I'm not a developer so my apologies if my comment does not make any sense. In the end any hypervisor should be able to provide some information about available memory, either by calculation of with a direct metric. Perhaps this will always be something hypervisor specifies...? > CitrixResourceBase wrongly calculates total memory on hosts with a lot of > memory and large Dom0 > --- > > Key: CLOUDSTACK-7857 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7857 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) >Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0 >Reporter: Joris van Lieshout >Priority: Blocker > > We have hosts with 256GB memory and 4GB dom0. During startup ACS calculates > available memory using this formula: > CitrixResourceBase.java > protected void fillHostInfo > ram = (long) ((ram - dom0Ram - _xs_memory_used) * > _xs_virtualization_factor); > In our situation: > ram = 274841497600 > dom0Ram = 4269801472 > _xs_memory_used = 128 * 1024 * 1024L = 134217728 > _xs_virtualization_factor = 63.0/64.0 = 0,984375 > (274841497600 - 4269801472 - 134217728) * 0,984375 = 266211892800 > This is in fact not the actual amount of memory available for instances. The > difference in our situation is a little less then 1GB. On this particular > hypervisor Dom0+Xen uses about 9GB. > As the comment above the definition of XsMemoryUsed allready stated it's time > to review this logic. > "//Hypervisor specific params with generic value, may need to be overridden > for specific versions" > The effect of this bug is that when you put a hypervisor in maintenance it > might try to move instances (usually small instances (<1GB)) to a host that > in fact does not have enought free memory. > This exception is thrown: > ERROR [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-09aca6e9 > work-8981) Terminating HAWork[8981-Migration-4482-Running-Migrating] > com.cloud.utils.exception.CloudRuntimeException: Unable to migrate due to > Catch Exception com.cloud.utils.exception.CloudRuntimeException: Migration > failed due to com.cloud.utils.exception.CloudRuntim > eException: Unable to migrate VM(r-4482-VM) from > host(6805d06c-4d5b-4438-a245-7915e93041d9) due to Task failed! Task record: > uuid: 645b63c8-1426-b412-7b6a-13d61ee7ab2e >nameLabel: Async.VM.pool_migrate > nameDescription: >allowedOperations: [] >currentOperations: {} > created: Thu Nov 06 13:44:14 CET 2014 > finished: Thu Nov 06 13:44:14 CET 2014 > status: failure > residentOn: com.xensource.xenapi.Host@b42882c6 > progress: 1.0 > type: > result: >errorInfo: [HOST_NOT_ENOUGH_FREE_MEMORY, 272629760, 263131136] > otherConfig: {} >subtaskOf: com.xensource.xenapi.Task@aaf13f6f > subtasks: [] > at > com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:1840) > at > com.cloud.vm.VirtualMachineManagerImpl.migrateAway(VirtualMachineManagerImpl.java:2214) > at > com.cloud.ha.HighAvailabilityManagerImpl.migrate(HighAvailabilityManagerImpl.java:610) > at > com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.runWithContext(HighAvailabilityManagerImpl.java:865) > at > com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.access$000(HighAvailabilityManagerImpl.java:822) > at > com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread$1.run(HighAvailabilityManagerImpl.java:834) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) > at > com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.run(HighAvailabilityManagerImpl.java:831) -- This mess
[jira] [Commented] (CLOUDSTACK-7857) CitrixResourceBase wrongly calculates total memory on hosts with a lot of memory and large Dom0
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214498#comment-14214498 ] Joris van Lieshout commented on CLOUDSTACK-7857: Hi Anthony, I agree that that there is no reliable way to do this beforehand so isn't it better to do it whenever an instance is started on/migrated to a host, or recalculate the free memory metric every couple minutes (for instance as part of the stats collection cycle)? The formula that is used by XenCenter for this seems pretty easy and spot. This would also reduce the number of times a retry mechanism has to kick in for other action as well. On that note, the retry mechanism you are referring to does not seem to apply to HA-workers created by the process that puts a host in maintenance. Also it feels to me that this is more of a workaround than a nice solution, mostly because host_free_mem can be recalculated quickly and easily when needed. And concerning the allocation threshold. If I'm not mistaking this does not apply to HA-workers which is being used whenever you put at host into maintenance. Additionally the instance being migrated is already in the cluster so this threshold is not hit during PrepairForMaintenance. > CitrixResourceBase wrongly calculates total memory on hosts with a lot of > memory and large Dom0 > --- > > Key: CLOUDSTACK-7857 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7857 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) >Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0 >Reporter: Joris van Lieshout >Priority: Blocker > > We have hosts with 256GB memory and 4GB dom0. During startup ACS calculates > available memory using this formula: > CitrixResourceBase.java > protected void fillHostInfo > ram = (long) ((ram - dom0Ram - _xs_memory_used) * > _xs_virtualization_factor); > In our situation: > ram = 274841497600 > dom0Ram = 4269801472 > _xs_memory_used = 128 * 1024 * 1024L = 134217728 > _xs_virtualization_factor = 63.0/64.0 = 0,984375 > (274841497600 - 4269801472 - 134217728) * 0,984375 = 266211892800 > This is in fact not the actual amount of memory available for instances. The > difference in our situation is a little less then 1GB. On this particular > hypervisor Dom0+Xen uses about 9GB. > As the comment above the definition of XsMemoryUsed allready stated it's time > to review this logic. > "//Hypervisor specific params with generic value, may need to be overridden > for specific versions" > The effect of this bug is that when you put a hypervisor in maintenance it > might try to move instances (usually small instances (<1GB)) to a host that > in fact does not have enought free memory. > This exception is thrown: > ERROR [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-09aca6e9 > work-8981) Terminating HAWork[8981-Migration-4482-Running-Migrating] > com.cloud.utils.exception.CloudRuntimeException: Unable to migrate due to > Catch Exception com.cloud.utils.exception.CloudRuntimeException: Migration > failed due to com.cloud.utils.exception.CloudRuntim > eException: Unable to migrate VM(r-4482-VM) from > host(6805d06c-4d5b-4438-a245-7915e93041d9) due to Task failed! Task record: > uuid: 645b63c8-1426-b412-7b6a-13d61ee7ab2e >nameLabel: Async.VM.pool_migrate > nameDescription: >allowedOperations: [] >currentOperations: {} > created: Thu Nov 06 13:44:14 CET 2014 > finished: Thu Nov 06 13:44:14 CET 2014 > status: failure > residentOn: com.xensource.xenapi.Host@b42882c6 > progress: 1.0 > type: > result: >errorInfo: [HOST_NOT_ENOUGH_FREE_MEMORY, 272629760, 263131136] > otherConfig: {} >subtaskOf: com.xensource.xenapi.Task@aaf13f6f > subtasks: [] > at > com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:1840) > at > com.cloud.vm.VirtualMachineManagerImpl.migrateAway(VirtualMachineManagerImpl.java:2214) > at > com.cloud.ha.HighAvailabilityManagerImpl.migrate(HighAvailabilityManagerImpl.java:610) > at > com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.runWithContext(HighAvailabilityManagerImpl.java:865) > at > com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.access$000(HighAvailabilityManagerImpl.java:822) > at > com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread$1.run(HighAvailabilityManagerImpl.java:834) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1
[jira] [Commented] (CLOUDSTACK-7857) CitrixResourceBase wrongly calculates total memory on hosts with a lot of memory and large Dom0
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209545#comment-14209545 ] Joris van Lieshout commented on CLOUDSTACK-7857: Hi Rohit, I did some digging around in the XenCenter code and found a possible solution there. But there is a challenge I think. The overhead is dynamic based on the instances running on the host, and, at the moment, ACS calculates this overhead at host thread startup. This is what I found in the XenCenter code: https://github.com/xenserver/xenadmin/blob/a0d31920c5ac62eda9713228043a834ba7829986/XenModel/XenAPI-Extensions/Host.cs#L1071 == public long xen_memory_calc { get { if (!Helpers.MidnightRideOrGreater(Connection)) { Host_metrics host_metrics = Connection.Resolve(this.metrics); if (host_metrics == null) return 0; long totalused = 0; foreach (VM vm in Connection.ResolveAll(resident_VMs)) { VM_metrics vmMetrics = vm.Connection.Resolve(vm.metrics); if (vmMetrics != null) totalused += vmMetrics.memory_actual; } return host_metrics.memory_total - totalused - host_metrics.memory_free; } long xen_mem = memory_overhead; foreach (VM vm in Connection.ResolveAll(resident_VMs)) { xen_mem += vm.memory_overhead; if (vm.is_control_domain) { VM_metrics vmMetrics = vm.Connection.Resolve(vm.metrics); if (vmMetrics != null) xen_mem += vmMetrics.memory_actual; } } return xen_mem; } } = We can skip the first part because, if I'm not mistaking, ACS only supports XS5.6 and up. XS5.6 = MidnightRide In short the formula is something like this: xen_mem = host_memory_overhead + residentVMs_memory_overhead + dom0_memory_actual Here is a list of xe commands that will get you the correct numbers to summarize. host_mem_overhead xe host-list name-label=$HOSTNAME params=memory-overhead --minimal residentVMs_memory_overhead xe vm-list resident-on=$(xe host-list name-label=$HOSTNAME --minimal) params=memory-overhead --minimal dom0_memory_actual xe vm-list resident-on=$(xe host-list name-label=$HOSTNAME --minimal) is-control-domain=true params=memory-actual --minimal > CitrixResourceBase wrongly calculates total memory on hosts with a lot of > memory and large Dom0 > --- > > Key: CLOUDSTACK-7857 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7857 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) >Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0 >Reporter: Joris van Lieshout >Priority: Blocker > > We have hosts with 256GB memory and 4GB dom0. During startup ACS calculates > available memory using this formula: > CitrixResourceBase.java > protected void fillHostInfo > ram = (long) ((ram - dom0Ram - _xs_memory_used) * > _xs_virtualization_factor); > In our situation: > ram = 274841497600 > dom0Ram = 4269801472 > _xs_memory_used = 128 * 1024 * 1024L = 134217728 > _xs_virtualization_factor = 63.0/64.0 = 0,984375 > (274841497600 - 4269801472 - 134217728) * 0,984375 = 266211892800 > This is in fact not the actual amount of memory available for instances. The > difference in our situation is a little less then 1GB. On this particular > hypervisor Dom0+Xen uses about 9GB. > As the comment above the definition of XsMemoryUsed allready stated it's time > to review this logic. > "//Hypervisor specific params with generic value, may need to be overridden > for specific versions" > The effect of this bug is that when you put a hypervisor in maintenance it > might try to move instances (usually small instances (<1GB)) to a host that > in fact does not have enought free memory. > This exception is thrown: > ERROR [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-09aca6e9 > work-8981) Terminating HAWork[8981-Migration-4482-Running-Migrating] > com.cloud.utils.exception.CloudRuntimeException: Unable to migrate due to > Catch Exception com.cloud.utils.exception.CloudRuntimeException: Migration > failed due to com.cloud.utils.exception.CloudRuntim > eExcep
[jira] [Commented] (CLOUDSTACK-7853) Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204744#comment-14204744 ] Joris van Lieshout commented on CLOUDSTACK-7853: What I just saw in our management log is that 3 minutes before the management server found the host behind on ping the cluster was put in Unmanage mode (XenServer patching maintenance). I also noticed that the AgentTaskPool threads that would do the investigation you mention was not triggered for this host. I don't know if this is because it was busy or because the agent thread was destroyed after the cluster was put in Unmanage. This is how I now believer it went. 1. Cluster Unmanage 2. Host rebooted (the brand of physical boxed we use need at least 10 minutes to reboot) 3. Host got behind on ping in the meanwhile 4. Host state transitioned from Disconnected to Alert via PingTimeout 5. On the next AgentMonitor cycle a transition was attempted form Alert via PingTimeout. Unknown transition so exception was thrown. 6. Host returned from reboot and cluster was set to manage again 7. Due to this invalid state transition the host never transitioned from Alert to something else. > Hosts that are temporary Disconnected and get behind on ping (PingTimeout) > turn up in permanent state Alert > --- > > Key: CLOUDSTACK-7853 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) >Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0 >Reporter: Joris van Lieshout >Priority: Critical > > If for some reason (I've been unable to determine why but my suspicion is > that the management server is busy processing other agent requests and/or > xapi is temporary unavailable) a host that is Disconnected gets behind on > ping (PingTimeout) it it transitioned to a permanent state of Alert. > INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the > following agents behind on ping: [421, 427, 425] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, > do invstigation > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state > = Enabled, Agent event = PingTimeout, Host id = 421, name = xx1] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = > 421; name = xx1; old status = Disconnected; event = PingTimeout; new > status = Alert; old update count = 111; new update count = 112] > / next cycle / - > INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the > following agents behind on ping: [421, 427, 425] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, > do invstigation > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state > = Enabled, Agent event = PingTimeout, Host id = 421, name = xx1] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent > status with event PingTimeout for host 421, name=xx1, mangement server id > is 345052370017 > ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the > following exception: > com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status > with event PingTimeout for host 421, mangement server id is > 345052370017,Unable to transition to a new state from Alert via PingTimeout > at > com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334) > at > com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349) > at > com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378) > at > com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384) > at > com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466) > at > org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) > at > org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.in
[jira] [Created] (CLOUDSTACK-7857) CitrixResourceBase wrongly calculates total memory on hosts with a lot of memory and large Dom0
Joris van Lieshout created CLOUDSTACK-7857: -- Summary: CitrixResourceBase wrongly calculates total memory on hosts with a lot of memory and large Dom0 Key: CLOUDSTACK-7857 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7857 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0 Reporter: Joris van Lieshout Priority: Blocker We have hosts with 256GB memory and 4GB dom0. During startup ACS calculates available memory using this formula: CitrixResourceBase.java protected void fillHostInfo ram = (long) ((ram - dom0Ram - _xs_memory_used) * _xs_virtualization_factor); In our situation: ram = 274841497600 dom0Ram = 4269801472 _xs_memory_used = 128 * 1024 * 1024L = 134217728 _xs_virtualization_factor = 63.0/64.0 = 0,984375 (274841497600 - 4269801472 - 134217728) * 0,984375 = 266211892800 This is in fact not the actual amount of memory available for instances. The difference in our situation is a little less then 1GB. On this particular hypervisor Dom0+Xen uses about 9GB. As the comment above the definition of XsMemoryUsed allready stated it's time to review this logic. "//Hypervisor specific params with generic value, may need to be overridden for specific versions" The effect of this bug is that when you put a hypervisor in maintenance it might try to move instances (usually small instances (<1GB)) to a host that in fact does not have enought free memory. This exception is thrown: ERROR [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-09aca6e9 work-8981) Terminating HAWork[8981-Migration-4482-Running-Migrating] com.cloud.utils.exception.CloudRuntimeException: Unable to migrate due to Catch Exception com.cloud.utils.exception.CloudRuntimeException: Migration failed due to com.cloud.utils.exception.CloudRuntim eException: Unable to migrate VM(r-4482-VM) from host(6805d06c-4d5b-4438-a245-7915e93041d9) due to Task failed! Task record: uuid: 645b63c8-1426-b412-7b6a-13d61ee7ab2e nameLabel: Async.VM.pool_migrate nameDescription: allowedOperations: [] currentOperations: {} created: Thu Nov 06 13:44:14 CET 2014 finished: Thu Nov 06 13:44:14 CET 2014 status: failure residentOn: com.xensource.xenapi.Host@b42882c6 progress: 1.0 type: result: errorInfo: [HOST_NOT_ENOUGH_FREE_MEMORY, 272629760, 263131136] otherConfig: {} subtaskOf: com.xensource.xenapi.Task@aaf13f6f subtasks: [] at com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:1840) at com.cloud.vm.VirtualMachineManagerImpl.migrateAway(VirtualMachineManagerImpl.java:2214) at com.cloud.ha.HighAvailabilityManagerImpl.migrate(HighAvailabilityManagerImpl.java:610) at com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.runWithContext(HighAvailabilityManagerImpl.java:865) at com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.access$000(HighAvailabilityManagerImpl.java:822) at com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread$1.run(HighAvailabilityManagerImpl.java:834) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) at com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.run(HighAvailabilityManagerImpl.java:831) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CLOUDSTACK-7853) Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert
Joris van Lieshout created CLOUDSTACK-7853: -- Summary: Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert Key: CLOUDSTACK-7853 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0 Reporter: Joris van Lieshout Priority: Critical If for some reason (I've been unable to determine why but my suspicion is that the management server is busy processing other agent requests and/or xapi is temporary unavailable) a host that is Disconnected gets behind on ping (PingTimeout) it it transitioned to a permanent state of Alert. INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the following agents behind on ping: [421, 427, 425] DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, do invstigation DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state = Enabled, Agent event = PingTimeout, Host id = 421, name = xx1] DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = 421; name = xx1; old status = Disconnected; event = PingTimeout; new status = Alert; old update count = 111; new update count = 112] / next cycle / - INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the following agents behind on ping: [421, 427, 425] DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, do invstigation DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state = Enabled, Agent event = PingTimeout, Host id = 421, name = xx1] DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent status with event PingTimeout for host 421, name=xx1, mangement server id is 345052370017 ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the following exception: com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status with event PingTimeout for host 421, mangement server id is 345052370017,Unable to transition to a new state from Alert via PingTimeout at com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334) at com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349) at com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378) at com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384) at com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466) at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) I think the bug occures because there is no valid state transition from Alert via PingTimeout to something recoverable Status.java s_fsm.addTransition(Status.Alert, Event.AgentConnected, Status.Connecting); s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up); s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed); s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, Status.Alert); s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, Status.Alert); s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, Status.Disconnected); As a workaround to get out of this situation we put the cluster in Unmanage, wait 10 minutes and put the cluster back in manage -- This message was sent by Atlassian
[jira] [Commented] (CLOUDSTACK-7839) Unable to live migrate an instance to another host in a cluster from which the template has been deleted
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196092#comment-14196092 ] Joris van Lieshout commented on CLOUDSTACK-7839: Additional information: The public boolean "storagePoolHasEnoughSpace" in StorageManagerImpl.java has a loop that goes through all volumes. The second if statement in the loop is where the NP exception is thrown because _templateDao.findById returns no templates for (Volume volume : volumes) { if (volume.getTemplateId() != null) { VMTemplateVO tmpl = _templateDao.findById(volume.getTemplateId()); if (tmpl.getFormat() != ImageFormat.ISO) { allocatedSizeWithtemplate = _capacityMgr.getAllocatedPoolCapacity(poolVO, tmpl); } } if (volume.getState() != Volume.State.Ready) { totalAskingSize = totalAskingSize + getVolumeSizeIncludingHvSsReserve(volume, pool); } } This SQL statement will show that the removed field of vm_template is not null causeing findById to return nothing. select vm_template.name, vm_template.removed from vm_instance join vm_template on vm_instance.vm_template_id=vm_template.id where vm_instance.name like '%testinstancefromtmpl1%'; vm_template.name, vm_template.removed 'testinstancetmp','2014-11-04 09:21:34' > Unable to live migrate an instance to another host in a cluster from which > the template has been deleted > > > Key: CLOUDSTACK-7839 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7839 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Template >Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0 >Reporter: Joris van Lieshout >Priority: Critical > > ACS throws an null pointer exception when you try to live migrate an instance > to another host in a cluster and the template of that instance has been > deleted. > I have pasted the exception below. > Steps to reproduce the issue: > 1. create an instance from iso > 2. stop the instance > 3. create a template from the root volume > 4. create a new instance from that template > 5. leave the instance running > 6. delete the template > 7. try the live migrate the instance to another host in the cluster > The migrate button in the web interface will not respond. > The exception below can be found in the management-server log > 2014-11-04 14:08:45,509 ERROR [cloud.api.ApiServer] > (TP-Processor49:ctx-35286d62 ctx-3de77f98) unhandled exception executing api > command: findHostsForMigration > java.lang.NullPointerException > at > com.cloud.storage.StorageManagerImpl.storagePoolHasEnoughSpace(StorageManagerImpl.java:1561) > at > org.apache.cloudstack.storage.allocator.AbstractStoragePoolAllocator.filter(AbstractStoragePoolAllocator.java:199) > at > org.apache.cloudstack.storage.allocator.ClusterScopeStoragePoolAllocator.select(ClusterScopeStoragePoolAllocator.java:110) > at > org.apache.cloudstack.storage.allocator.AbstractStoragePoolAllocator.allocateToPool(AbstractStoragePoolAllocator.java:109) > at > com.cloud.server.ManagementServerImpl.findSuitablePoolsForVolumes(ManagementServerImpl.java:1250) > at > com.cloud.server.ManagementServerImpl.listHostsForMigrationOfVM(ManagementServerImpl.java:1150) > at sun.reflect.GeneratedMethodAccessor643.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:622) > at > org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317) > at > org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183) > at > org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) > at > org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91) > at > org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) > at > org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204) > at com.sun.proxy.$Proxy193.listHostsForMigrationOfVM(Unknown Source) > at > org.apache.cloudstack.api.command.admin.host.FindHostsForMigrationCmd.execute(FindHostsForMigrationCmd.java:75) > at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161) > at com.cloud.api.ApiServer.queueCommand(Ap
[jira] [Created] (CLOUDSTACK-7839) Unable to live migrate an instance to another host in a cluster from which the template has been deleted
Joris van Lieshout created CLOUDSTACK-7839: -- Summary: Unable to live migrate an instance to another host in a cluster from which the template has been deleted Key: CLOUDSTACK-7839 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7839 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Components: Template Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0 Reporter: Joris van Lieshout Priority: Critical ACS throws an null pointer exception when you try to live migrate an instance to another host in a cluster and the template of that instance has been deleted. I have pasted the exception below. Steps to reproduce the issue: 1. create an instance from iso 2. stop the instance 3. create a template from the root volume 4. create a new instance from that template 5. leave the instance running 6. delete the template 7. try the live migrate the instance to another host in the cluster The migrate button in the web interface will not respond. The exception below can be found in the management-server log 2014-11-04 14:08:45,509 ERROR [cloud.api.ApiServer] (TP-Processor49:ctx-35286d62 ctx-3de77f98) unhandled exception executing api command: findHostsForMigration java.lang.NullPointerException at com.cloud.storage.StorageManagerImpl.storagePoolHasEnoughSpace(StorageManagerImpl.java:1561) at org.apache.cloudstack.storage.allocator.AbstractStoragePoolAllocator.filter(AbstractStoragePoolAllocator.java:199) at org.apache.cloudstack.storage.allocator.ClusterScopeStoragePoolAllocator.select(ClusterScopeStoragePoolAllocator.java:110) at org.apache.cloudstack.storage.allocator.AbstractStoragePoolAllocator.allocateToPool(AbstractStoragePoolAllocator.java:109) at com.cloud.server.ManagementServerImpl.findSuitablePoolsForVolumes(ManagementServerImpl.java:1250) at com.cloud.server.ManagementServerImpl.listHostsForMigrationOfVM(ManagementServerImpl.java:1150) at sun.reflect.GeneratedMethodAccessor643.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204) at com.sun.proxy.$Proxy193.listHostsForMigrationOfVM(Unknown Source) at org.apache.cloudstack.api.command.admin.host.FindHostsForMigrationCmd.execute(FindHostsForMigrationCmd.java:75) at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161) at com.cloud.api.ApiServer.queueCommand(ApiServer.java:531) at com.cloud.api.ApiServer.handleRequest(ApiServer.java:374) at com.cloud.api.ApiServlet.processRequestInContext(ApiServlet.java:323) at com.cloud.api.ApiServlet.access$000(ApiServlet.java:53) at com.cloud.api.ApiServlet$1.run(ApiServlet.java:115) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) at com.cloud.api.ApiServlet.processRequest(ApiServlet.java:112) at com.cloud.api.ApiServlet.doGet(ApiServlet.java:74) at javax.servlet.http.HttpServlet.service(HttpServlet.java:617) at javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.AccessLogValve.invoke(Acce
[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125514#comment-14125514 ] Joris van Lieshout commented on CLOUDSTACK-7184: Hi, I am currently out of office and will be back Tuesday the 23rd of September. During this time I will have limited access to e-mail and might not be able to take your call. For urgent matter regarding ASR please contact int-...@schubergphilis.com instead. For Cloud IaaS matters please contact int-cl...@schubergphilis.com. Kind regards, Joris van Lieshout Schuberg Philis schubergphilis.com +31207506672 +31651428188 > HA should wait for at least 'xen.heartbeat.interval' sec before starting HA > on vm's when host is marked down > > > Key: CLOUDSTACK-7184 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Hypervisor Controller, Management Server, XenServer >Affects Versions: 4.3.0, 4.4.0, 4.5.0 > Environment: CloudStack 4.3 with XenServer 6.2 hypervisors >Reporter: Remi Bergsma >Priority: Blocker > > Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did > discover this and marked the host as down, and immediately started HA. Just > 18 seconds later the hypervisor returned and we ended up with 5 vm's that > were running on two hypervisors at the same time. > This, of course, resulted in file system corruption and the loss of the vm's. > One side of the story is why XenServer allowed this to happen (will not > bother you with this one). The CloudStack side of the story: HA should only > start after at least xen.heartbeat.interval seconds. If the host is down long > enough, the Xen heartbeat script will fence the hypervisor and prevent > corruption. If it is not down long enough, nothing should happen. > Logs (short): > 2014-07-25 05:03:28,596 WARN [c.c.a.m.DirectAgentAttache] > (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX) > . > 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] > (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX. Starting HA on > the VMs > . > 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager > Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = > AgentDisconnected, Host id = 505, name = mccpvmXX] > cs marks host down: 2014-07-25 05:03:31,920 > cs marks host up: 2014-07-25 05:03:49,655 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105718#comment-14105718 ] Joris van Lieshout commented on CLOUDSTACK-7184: Hi, I am currently out of office and will be back Wednesday the 27th of August. During this time I will have limited access to e-mail and might not be able to take your call. For urgent matter regarding ASR please contact int-...@schubergphilis.com instead. For other urgent matter please contact one of my colleagues. Kind regards, Joris van Lieshout Schuberg Philis schubergphilis.com +31207506672 +31651428188 > HA should wait for at least 'xen.heartbeat.interval' sec before starting HA > on vm's when host is marked down > > > Key: CLOUDSTACK-7184 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Hypervisor Controller, Management Server, XenServer >Affects Versions: 4.3.0, 4.4.0, 4.5.0 > Environment: CloudStack 4.3 with XenServer 6.2 hypervisors >Reporter: Remi Bergsma >Priority: Blocker > > Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did > discover this and marked the host as down, and immediately started HA. Just > 18 seconds later the hypervisor returned and we ended up with 5 vm's that > were running on two hypervisors at the same time. > This, of course, resulted in file system corruption and the loss of the vm's. > One side of the story is why XenServer allowed this to happen (will not > bother you with this one). The CloudStack side of the story: HA should only > start after at least xen.heartbeat.interval seconds. If the host is down long > enough, the Xen heartbeat script will fence the hypervisor and prevent > corruption. If it is not down long enough, nothing should happen. > Logs (short): > 2014-07-25 05:03:28,596 WARN [c.c.a.m.DirectAgentAttache] > (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX) > . > 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] > (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX. Starting HA on > the VMs > . > 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager > Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = > AgentDisconnected, Host id = 505, name = mccpvmXX] > cs marks host down: 2014-07-25 05:03:31,920 > cs marks host up: 2014-07-25 05:03:49,655 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CLOUDSTACK-7319) Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to copy incremental snapshots
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093971#comment-14093971 ] Joris van Lieshout commented on CLOUDSTACK-7319: We believe Hot-fix 4 for XS62 sp1 contains a similar fix but for the sparse dd process used for the first copy of a chain. http://support.citrix.com/article/CTX140417 == begin quote == Copying a virtual disk between SRs uses the unbuffered I/O to avoid polluting the pagecache in the Control Domain (dom0). This reduces the dom0 vCPU overhead and allows the pagecache to work more effectively for other operations. == end quote == > Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to > copy incremental snapshots > --- > > Key: CLOUDSTACK-7319 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Snapshot, XenServer >Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, Future, 4.2.1, > 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1 >Reporter: Joris van Lieshout >Priority: Critical > > We noticed that the dd process was way to agressive on Dom0 causing all kinds > of problems on a xenserver with medium workloads. > ACS uses the dd command to copy incremental snapshots to secondary storage. > This process is to heavy on Dom0 resources and even impacts DomU performance, > and can even lead to domain freezes (including Dom0) of more then a minute. > We've found that this is because the Dom0 kernel caches the read and write > operations of dd. > Some of the issues we have seen as a consequence of this are: > - DomU performance/freezes > - OVS freeze and not forwarding any traffic > - Including LACPDUs resulting in the bond going down > - keepalived heartbeat packets between RRVMs not being send/received > resulting in flapping RRVM master state > - Braking snapshot copy processes > - the xenserver heartbeat script reaching it's timeout and fencing the server > - poolmaster connection loss > - ACS marking the host as down and fencing the instances even though they are > still running on the origional host resulting in the same instance running on > to hosts in one cluster > - vhd corruption are a result of some of the issues mentioned above > We've developed a patch on the xenserver scripts > /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input > and output files (iflag=direct oflag=direct). > Our test have shown that Dom0 load during snapshot copy is way lower. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CLOUDSTACK-7319) Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to copy incremental snapshots
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris van Lieshout updated CLOUDSTACK-7319: --- Description: We noticed that the dd process was way to agressive on Dom0 causing all kinds of problems on a xenserver with medium workloads. ACS uses the dd command to copy incremental snapshots to secondary storage. This process is to heavy on Dom0 resources and even impacts DomU performance, and can even lead to domain freezes (including Dom0) of more then a minute. We've found that this is because the Dom0 kernel caches the read and write operations of dd. Some of the issues we have seen as a consequence of this are: - DomU performance/freezes - OVS freeze and not forwarding any traffic - Including LACPDUs resulting in the bond going down - keepalived heartbeat packets between RRVMs not being send/received resulting in flapping RRVM master state - Braking snapshot copy processes - the xenserver heartbeat script reaching it's timeout and fencing the server - poolmaster connection loss - ACS marking the host as down and fencing the instances even though they are still running on the origional host resulting in the same instance running on to hosts in one cluster - vhd corruption are a result of some of the issues mentioned above We've developed a patch on the xenserver scripts /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input and output files (iflag=direct oflag=direct). Our test have shown that Dom0 load during snapshot copy is way lower. was: We noticed that the dd process was way to agressive on Dom0 causing all kinds of problems on a xenserver with medium workloads. ACS uses the dd command to copy incremental snapshots to secondary storage. This process is to heavy on Dom0 resources and even impacts DomU performance, and can even lead to domain freezes (including Dom0) of more then a minute. We've found that this is because the Dom0 kernel caches the read and write operations of dd. We've developed a patch on the xenserver scripts /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input and output files. Our test have shown that Dom0 load during snapshot copy is way lower. I will upload the patch on review. > Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to > copy incremental snapshots > --- > > Key: CLOUDSTACK-7319 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Snapshot, XenServer >Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, Future, 4.2.1, > 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1 >Reporter: Joris van Lieshout >Priority: Critical > > We noticed that the dd process was way to agressive on Dom0 causing all kinds > of problems on a xenserver with medium workloads. > ACS uses the dd command to copy incremental snapshots to secondary storage. > This process is to heavy on Dom0 resources and even impacts DomU performance, > and can even lead to domain freezes (including Dom0) of more then a minute. > We've found that this is because the Dom0 kernel caches the read and write > operations of dd. > Some of the issues we have seen as a consequence of this are: > - DomU performance/freezes > - OVS freeze and not forwarding any traffic > - Including LACPDUs resulting in the bond going down > - keepalived heartbeat packets between RRVMs not being send/received > resulting in flapping RRVM master state > - Braking snapshot copy processes > - the xenserver heartbeat script reaching it's timeout and fencing the server > - poolmaster connection loss > - ACS marking the host as down and fencing the instances even though they are > still running on the origional host resulting in the same instance running on > to hosts in one cluster > - vhd corruption are a result of some of the issues mentioned above > We've developed a patch on the xenserver scripts > /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input > and output files (iflag=direct oflag=direct). > Our test have shown that Dom0 load during snapshot copy is way lower. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CLOUDSTACK-7319) Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to copy incremental snapshots
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris van Lieshout updated CLOUDSTACK-7319: --- Summary: Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to copy incremental snapshots (was: Copy Snapshot command to heavy on XenServer Dom0 resources when using dd to copy incremental snapshots) > Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to > copy incremental snapshots > --- > > Key: CLOUDSTACK-7319 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Snapshot, XenServer >Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, Future, 4.2.1, > 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1 >Reporter: Joris van Lieshout >Priority: Critical > > We noticed that the dd process was way to agressive on Dom0 causing all kinds > of problems on a xenserver with medium workloads. > ACS uses the dd command to copy incremental snapshots to secondary storage. > This process is to heavy on Dom0 resources and even impacts DomU performance, > and can even lead to domain freezes (including Dom0) of more then a minute. > We've found that this is because the Dom0 kernel caches the read and write > operations of dd. > We've developed a patch on the xenserver scripts > /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input > and output files. > Our test have shown that Dom0 load during snapshot copy is way lower. I will > upload the patch on review. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CLOUDSTACK-7319) Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to copy incremental snapshots
[ https://issues.apache.org/jira/browse/CLOUDSTACK-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093962#comment-14093962 ] Joris van Lieshout commented on CLOUDSTACK-7319: review https://reviews.apache.org/r/24598/ > Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to > copy incremental snapshots > --- > > Key: CLOUDSTACK-7319 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Snapshot, XenServer >Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, Future, 4.2.1, > 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1 >Reporter: Joris van Lieshout >Priority: Critical > > We noticed that the dd process was way to agressive on Dom0 causing all kinds > of problems on a xenserver with medium workloads. > ACS uses the dd command to copy incremental snapshots to secondary storage. > This process is to heavy on Dom0 resources and even impacts DomU performance, > and can even lead to domain freezes (including Dom0) of more then a minute. > We've found that this is because the Dom0 kernel caches the read and write > operations of dd. > Some of the issues we have seen as a consequence of this are: > - DomU performance/freezes > - OVS freeze and not forwarding any traffic > - Including LACPDUs resulting in the bond going down > - keepalived heartbeat packets between RRVMs not being send/received > resulting in flapping RRVM master state > - Braking snapshot copy processes > - the xenserver heartbeat script reaching it's timeout and fencing the server > - poolmaster connection loss > - ACS marking the host as down and fencing the instances even though they are > still running on the origional host resulting in the same instance running on > to hosts in one cluster > - vhd corruption are a result of some of the issues mentioned above > We've developed a patch on the xenserver scripts > /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input > and output files (iflag=direct oflag=direct). > Our test have shown that Dom0 load during snapshot copy is way lower. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CLOUDSTACK-7319) Copy Snapshot command to heavy on XenServer Dom0 resources when using dd to copy incremental snapshots
Joris van Lieshout created CLOUDSTACK-7319: -- Summary: Copy Snapshot command to heavy on XenServer Dom0 resources when using dd to copy incremental snapshots Key: CLOUDSTACK-7319 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Components: Snapshot, XenServer Affects Versions: 4.2.0, 4.1.0, 4.0.2, 4.0.1, 4.0.0, 4.1.1, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1 Reporter: Joris van Lieshout Priority: Critical We noticed that the dd process was way to agressive on Dom0 causing all kinds of problems on a xenserver with medium workloads. ACS uses the dd command to copy incremental snapshots to secondary storage. This process is to heavy on Dom0 resources and even impacts DomU performance, and can even lead to domain freezes (including Dom0) of more then a minute. We've found that this is because the Dom0 kernel caches the read and write operations of dd. We've developed a patch on the xenserver scripts /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input and output files. Our test have shown that Dom0 load during snapshot copy is way lower. I will upload the patch on review. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CLOUDSTACK-7103) Disable in-band management of OVS on cloud_link_local_network on XenServer
Joris van Lieshout created CLOUDSTACK-7103: -- Summary: Disable in-band management of OVS on cloud_link_local_network on XenServer Key: CLOUDSTACK-7103 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7103 Project: CloudStack Issue Type: Improvement Security Level: Public (Anyone can view this level - this is the default.) Components: XenServer Affects Versions: 4.2.0, 4.1.0, 4.0.2, 4.0.1, 4.0.0, 4.1.1, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1 Reporter: Joris van Lieshout By default XenServer uses Openvswitch and has in-band management enabled on any new network. This is not desirable for the cloud_link_local_network. This can be disabled by setting the network's other config parameter vswitch-disable-in-band to true. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CLOUDSTACK-6308) when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006931#comment-14006931 ] Joris van Lieshout commented on CLOUDSTACK-6308: I've tried reproducing this issue in 4.3 but have not been able to so it seems resolved. I'll close this bug for now and reopen if needed. > when executing createNetwork as ROOT for a subdomain/account it checks for > network overlap in all subdomains/accounts > - > > Key: CLOUDSTACK-6308 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6308 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: API >Affects Versions: 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1 >Reporter: Joris van Lieshout > Fix For: 4.3.0 > > > When executing createNetwork with an account from the ROOT domain with a > domainid and account specified of a subdomain/account the error below is > thrown when the ip range overlaps with a network of another subdomain. > errorCode: 431, errorText:The IP range has already been added with gateway > 192.168.150.1 ,and netmask 255.255.255.0, Please specify the gateway/netmask > if you want to extend ip range > scenario: > ROOT/ACME has network 192.168.150.0/24 gw 192.168.150.1 > exec createNetwork for ROOT/MEGACORP account johndoe network 192.168.150.0/24 > gw 192.168.150.1 with ROOT domain credentials. > workaround for now: > execute createNetwork with credentials from domain MEGACORP and account > johndoe. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (CLOUDSTACK-6308) when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris van Lieshout closed CLOUDSTACK-6308. -- Resolution: Cannot Reproduce Fix Version/s: 4.3.0 > when executing createNetwork as ROOT for a subdomain/account it checks for > network overlap in all subdomains/accounts > - > > Key: CLOUDSTACK-6308 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6308 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: API >Affects Versions: 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1 >Reporter: Joris van Lieshout > Fix For: 4.3.0 > > > When executing createNetwork with an account from the ROOT domain with a > domainid and account specified of a subdomain/account the error below is > thrown when the ip range overlaps with a network of another subdomain. > errorCode: 431, errorText:The IP range has already been added with gateway > 192.168.150.1 ,and netmask 255.255.255.0, Please specify the gateway/netmask > if you want to extend ip range > scenario: > ROOT/ACME has network 192.168.150.0/24 gw 192.168.150.1 > exec createNetwork for ROOT/MEGACORP account johndoe network 192.168.150.0/24 > gw 192.168.150.1 with ROOT domain credentials. > workaround for now: > execute createNetwork with credentials from domain MEGACORP and account > johndoe. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CLOUDSTACK-6308) when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris van Lieshout updated CLOUDSTACK-6308: --- Priority: Major (was: Critical) > when executing createNetwork as ROOT for a subdomain/account it checks for > network overlap in all subdomains/accounts > - > > Key: CLOUDSTACK-6308 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6308 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: API >Affects Versions: 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1 >Reporter: Joris van Lieshout > > When executing createNetwork with an account from the ROOT domain with a > domainid and account specified of a subdomain/account the error below is > thrown when the ip range overlaps with a network of another subdomain. > errorCode: 431, errorText:The IP range has already been added with gateway > 192.168.150.1 ,and netmask 255.255.255.0, Please specify the gateway/netmask > if you want to extend ip range > scenario: > ROOT/ACME has network 192.168.150.0/24 gw 192.168.150.1 > exec createNetwork for ROOT/MEGACORP account johndoe network 192.168.150.0/24 > gw 192.168.150.1 with ROOT domain credentials. > workaround for now: > execute createNetwork with credentials from domain MEGACORP and account > johndoe. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CLOUDSTACK-6308) when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006920#comment-14006920 ] Joris van Lieshout commented on CLOUDSTACK-6308: This issues still exists and as far as i know has not yet been fixed. I will poke the dev list to see if anyone can have a look. > when executing createNetwork as ROOT for a subdomain/account it checks for > network overlap in all subdomains/accounts > - > > Key: CLOUDSTACK-6308 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6308 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: API >Affects Versions: 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1 >Reporter: Joris van Lieshout >Priority: Critical > > When executing createNetwork with an account from the ROOT domain with a > domainid and account specified of a subdomain/account the error below is > thrown when the ip range overlaps with a network of another subdomain. > errorCode: 431, errorText:The IP range has already been added with gateway > 192.168.150.1 ,and netmask 255.255.255.0, Please specify the gateway/netmask > if you want to extend ip range > scenario: > ROOT/ACME has network 192.168.150.0/24 gw 192.168.150.1 > exec createNetwork for ROOT/MEGACORP account johndoe network 192.168.150.0/24 > gw 192.168.150.1 with ROOT domain credentials. > workaround for now: > execute createNetwork with credentials from domain MEGACORP and account > johndoe. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CLOUDSTACK-6751) conntrackd stats logging is enabled by default and fills up /var
Joris van Lieshout created CLOUDSTACK-6751: -- Summary: conntrackd stats logging is enabled by default and fills up /var Key: CLOUDSTACK-6751 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6751 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Components: SystemVM Affects Versions: 4.3.0 Reporter: Joris van Lieshout Conntrackd package has a bug where the comment in the default config file states that stats logging is disabled by default but the config parameter is set to on. The consequence for ACS is that a conntrackd-stats.log file is created during the build of the svm. This logfile gets rotated by logrotate which has a post action to restart conntrackd. Even if the svm is not a redundant router. On vpc routers for instance the stats logging file can grow quickly and fill up the /var volume killing the vm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CLOUDSTACK-6716) /usr has been sized to small and ends up being 100% full on SSVM and CVM
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003374#comment-14003374 ] Joris van Lieshout commented on CLOUDSTACK-6716: Created review request https://reviews.apache.org/r/21696/ > /usr has been sized to small and ends up being 100% full on SSVM and CVM > > > Key: CLOUDSTACK-6716 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6716 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: SystemVM >Affects Versions: Future, 4.3.0, 4.4.0 >Reporter: Joris van Lieshout > > The systemvmtemplate for 4.3 and 4.4 has a too small /usr volume and ends up > 100% full on Secondary Storage VMs and Console VMs. > root@v-xxx-VM:~# df -h > Filesystem Size Used Avail Use% > Mounted on > rootfs 276M 144M 118M 55% > / > udev 10M 0 10M 0% > /dev > tmpfs 100M 156K 100M 1% > /run > /dev/disk/by-uuid/0721ecee-214a-4143-8d88-a4075cc2cd89 276M 144M 118M 55% > / > tmpfs 5.0M 0 5.0M 0% > /run/lock > tmpfs 314M 0 314M 0% > /run/shm > /dev/xvda1 45M 22M 21M 51% > /boot > /dev/xvda6 98M 5.6M 88M 6% > /home > /dev/xvda8 368M 11M 339M 3% > /opt > /dev/xvda10 63M 5.3M 55M 9% > /tmp > /dev/xvda7 610M 584M 0 100% > /usr > /dev/xvda9 415M 316M 78M 81% > /var -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CLOUDSTACK-6716) /usr has been sized to small and ends up being 100% full on SSVM and CVM
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003050#comment-14003050 ] Joris van Lieshout commented on CLOUDSTACK-6716: I already have a solution for this. Will submit the patch on review board today. > /usr has been sized to small and ends up being 100% full on SSVM and CVM > > > Key: CLOUDSTACK-6716 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6716 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: SystemVM >Affects Versions: Future, 4.3.0, 4.4.0 >Reporter: Joris van Lieshout > > The systemvmtemplate for 4.3 and 4.4 has a too small /usr volume and ends up > 100% full on Secondary Storage VMs and Console VMs. > root@v-xxx-VM:~# df -h > Filesystem Size Used Avail Use% > Mounted on > rootfs 276M 144M 118M 55% > / > udev 10M 0 10M 0% > /dev > tmpfs 100M 156K 100M 1% > /run > /dev/disk/by-uuid/0721ecee-214a-4143-8d88-a4075cc2cd89 276M 144M 118M 55% > / > tmpfs 5.0M 0 5.0M 0% > /run/lock > tmpfs 314M 0 314M 0% > /run/shm > /dev/xvda1 45M 22M 21M 51% > /boot > /dev/xvda6 98M 5.6M 88M 6% > /home > /dev/xvda8 368M 11M 339M 3% > /opt > /dev/xvda10 63M 5.3M 55M 9% > /tmp > /dev/xvda7 610M 584M 0 100% > /usr > /dev/xvda9 415M 316M 78M 81% > /var -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CLOUDSTACK-6716) /usr has been sized to small and ends up being 100% full on SSVM and CVM
Joris van Lieshout created CLOUDSTACK-6716: -- Summary: /usr has been sized to small and ends up being 100% full on SSVM and CVM Key: CLOUDSTACK-6716 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6716 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Components: SystemVM Affects Versions: Future, 4.3.0, 4.4.0 Reporter: Joris van Lieshout The systemvmtemplate for 4.3 and 4.4 has a too small /usr volume and ends up 100% full on Secondary Storage VMs and Console VMs. root@v-xxx-VM:~# df -h Filesystem Size Used Avail Use% Mounted on rootfs 276M 144M 118M 55% / udev 10M 0 10M 0% /dev tmpfs 100M 156K 100M 1% /run /dev/disk/by-uuid/0721ecee-214a-4143-8d88-a4075cc2cd89 276M 144M 118M 55% / tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 314M 0 314M 0% /run/shm /dev/xvda1 45M 22M 21M 51% /boot /dev/xvda6 98M 5.6M 88M 6% /home /dev/xvda8 368M 11M 339M 3% /opt /dev/xvda10 63M 5.3M 55M 9% /tmp /dev/xvda7 610M 584M 0 100% /usr /dev/xvda9 415M 316M 78M 81% /var -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CLOUDSTACK-6308) when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts
Joris van Lieshout created CLOUDSTACK-6308: -- Summary: when executing createNetwork as ROOT for a subdomain/account it checks for network overlap in all subdomains/accounts Key: CLOUDSTACK-6308 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6308 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Components: API Affects Versions: 4.2.0, Future, 4.2.1, 4.3.0, 4.4.0, 4.5.0, 4.3.1 Reporter: Joris van Lieshout Priority: Critical When executing createNetwork with an account from the ROOT domain with a domainid and account specified of a subdomain/account the error below is thrown when the ip range overlaps with a network of another subdomain. errorCode: 431, errorText:The IP range has already been added with gateway 192.168.150.1 ,and netmask 255.255.255.0, Please specify the gateway/netmask if you want to extend ip range scenario: ROOT/ACME has network 192.168.150.0/24 gw 192.168.150.1 exec createNetwork for ROOT/MEGACORP account johndoe network 192.168.150.0/24 gw 192.168.150.1 with ROOT domain credentials. workaround for now: execute createNetwork with credentials from domain MEGACORP and account johndoe. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CLOUDSTACK-6223) removeNicFromVirtualMachine fails if another instance in another domain has a nic with the same ip and a forwarding rule configured on it
Joris van Lieshout created CLOUDSTACK-6223: -- Summary: removeNicFromVirtualMachine fails if another instance in another domain has a nic with the same ip and a forwarding rule configured on it Key: CLOUDSTACK-6223 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6223 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Affects Versions: 4.2.1 Reporter: Joris van Lieshout Priority: Blocker When removeNicFromVirtualMachine is called for a nic on an instance the code below is evaluated. This piece of code searches for portforwarding rules across all domains. If another instance exists that has a nic with the same ip and a forwarding rule the search returns >1 and the removeNicFromVirtualMachine call failed. server/src/com/cloud/network/rules/RulesManagerImpl.java @Override public List listAssociatedRulesForGuestNic(Nic nic){ List result = new ArrayList(); // add PF rules result.addAll(_portForwardingDao.listByDestIpAddr(nic.getIp4Address())); // add static NAT rules Stack trace: 2014-03-11 15:24:04,944 ERROR [cloud.async.AsyncJobManagerImpl] (Job-Executor-102:job-193607 = [ 30e81de3-2a00-49f2-8d80-545a765e4c1e ]) Unexpected exception while executing org.apache.cloudstack.api.command.user.vm.RemoveNicFromVMCmd com.cloud.utils.exception.CloudRuntimeException: Failed to remove nic from VM[User|zzz1] in Ntwk[994|Guest|14], nic has associated Port forwarding or Load balancer or Static NAT rules. at com.cloud.vm.VirtualMachineManagerImpl.removeNicFromVm(VirtualMachineManagerImpl.java:3058) at com.cloud.vm.UserVmManagerImpl.removeNicFromVirtualMachine(UserVmManagerImpl.java:1031) at org.apache.cloudstack.api.command.user.vm.RemoveNicFromVMCmd.execute(RemoveNicFromVMCmd.java:103) at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:158) at com.cloud.async.AsyncJobManagerImpl$1.run(AsyncJobManagerImpl.java:531) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CLOUDSTACK-6195) an ACS db upgraded from Pre-4.0 version is missing unique key constraint on host_details
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919341#comment-13919341 ] Joris van Lieshout commented on CLOUDSTACK-6195: Hi Wei Zhou, We were really looking forward to 4.x I guess. :) Anyway, this explains the issue. We've already fixed the constraint and will be doing a schema compare to make sure this was the only discrepancy. I've created this ticket as a courtesy just in case any one else would run into this. Good to hear we're probably the only one. :) For me case closed as "non-issue". Thanks again! > an ACS db upgraded from Pre-4.0 version is missing unique key constraint on > host_details > > > Key: CLOUDSTACK-6195 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6195 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Upgrade >Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, 4.2.1, 4.1.2 > Environment: Pre-4.0 db upgraded to 4.x. We have confirmed this bug > in a db that started out as 2.2.14. >Reporter: Joris van Lieshout > > This is the table in our 4.2.1 env that has been upgraded from 2.2.14. > CREATE TABLE `host_details` ( > `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, > `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', > `name` varchar(255) NOT NULL, > `value` varchar(255) NOT NULL, > PRIMARY KEY (`id`), > KEY `fk_host_details__host_id` (`host_id`), > CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES > `host` (`id`) ON DELETE CASCADE > ) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8; > And this is the table of a fresh 4.x install: > CREATE TABLE `host_details` ( > `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, > `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', > `name` varchar(255) NOT NULL, > `value` varchar(255) NOT NULL, > PRIMARY KEY (`id`), > UNIQUE KEY `uk_host_id_name` (`host_id`,`name`), > KEY `fk_host_details__host_id` (`host_id`), > CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES > `host` (`id`) ON DELETE CASCADE > ) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8; > The effect of this missing bug is a lot of duplicate entries in the > host_details table. The duplicate information on the host_details table > causes the api call listHosts to return the same host tag multiple time (to > be exact: the number of duplicate entries in the host_details table for that > host). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CLOUDSTACK-6195) an ACS db upgraded from Pre-4.0 version is missing unique key constraint on host_details
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919294#comment-13919294 ] Joris van Lieshout commented on CLOUDSTACK-6195: Hi Wei Zhou, Thank you for having a look. If I check the schema-create script of 2.2.14 (https://github.com/CloudStack-extras/CloudStack-archive/blob/2.2.14/setup/db/create-schema.sql) I see that the constraint is not there. I will check if the scripts of 3.0.0, 3.0.1 and 3.0.2 as well and update this ticket. Our upgrade path up until 4.0 is the same as yours. 1 2.2.14.20120210102939 2012-03-20 19:46:38 Complete 2 3.0.0 2012-06-22 12:48:19 Complete 3 3.0.1 2012-06-22 12:48:19 Complete 4 3.0.2 2012-06-22 12:48:19 Complete 7 4.0.0 2012-08-21 13:00:14 Complete 9 4.0.1 2013-02-13 12:36:24 Complete 11 4.0.2 2013-04-23 07:21:08 Complete 13 4.1.0 2013-07-16 09:43:23 Complete 15 4.1.1 2013-07-16 09:43:23 Complete 17 4.2.0 2013-12-18 09:38:25 Complete 19 4.2.1 2013-12-18 09:38:25 Complete > an ACS db upgraded from Pre-4.0 version is missing unique key constraint on > host_details > > > Key: CLOUDSTACK-6195 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6195 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Upgrade >Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, 4.2.1, 4.1.2 > Environment: Pre-4.0 db upgraded to 4.x. We have confirmed this bug > in a db that started out as 2.2.14. >Reporter: Joris van Lieshout > > This is the table in our 4.2.1 env that has been upgraded from 2.2.14. > CREATE TABLE `host_details` ( > `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, > `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', > `name` varchar(255) NOT NULL, > `value` varchar(255) NOT NULL, > PRIMARY KEY (`id`), > KEY `fk_host_details__host_id` (`host_id`), > CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES > `host` (`id`) ON DELETE CASCADE > ) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8; > And this is the table of a fresh 4.x install: > CREATE TABLE `host_details` ( > `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, > `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', > `name` varchar(255) NOT NULL, > `value` varchar(255) NOT NULL, > PRIMARY KEY (`id`), > UNIQUE KEY `uk_host_id_name` (`host_id`,`name`), > KEY `fk_host_details__host_id` (`host_id`), > CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES > `host` (`id`) ON DELETE CASCADE > ) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8; > The effect of this missing bug is a lot of duplicate entries in the > host_details table. The duplicate information on the host_details table > causes the api call listHosts to return the same host tag multiple time (to > be exact: the number of duplicate entries in the host_details table for that > host). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CLOUDSTACK-6195) an ACS db upgraded from Pre-4.0 version is missing unique key constraint on host_details
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris van Lieshout updated CLOUDSTACK-6195: --- Description: This is the table in our 4.2.1 env that has been upgraded from 2.2.14. CREATE TABLE `host_details` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', `name` varchar(255) NOT NULL, `value` varchar(255) NOT NULL, PRIMARY KEY (`id`), KEY `fk_host_details__host_id` (`host_id`), CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES `host` (`id`) ON DELETE CASCADE ) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8; And this is the table of a fresh 4.x install: CREATE TABLE `host_details` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', `name` varchar(255) NOT NULL, `value` varchar(255) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `uk_host_id_name` (`host_id`,`name`), KEY `fk_host_details__host_id` (`host_id`), CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES `host` (`id`) ON DELETE CASCADE ) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8; The effect of this missing bug is a lot of duplicate entries in the host_details table. The duplicate information on the host_details table causes the api call listHosts to return the same host tag multiple time (to be exact: the number of duplicate entries in the host_details table for that host). was: This is the table in our 4.2.1 env that has been upgraded from 2.2.14. CREATE TABLE `host_details` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', `name` varchar(255) NOT NULL, `value` varchar(255) NOT NULL, PRIMARY KEY (`id`), KEY `fk_host_details__host_id` (`host_id`), CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES `host` (`id`) ON DELETE CASCADE ) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8; And this is the table of a fresh 4.x install: CREATE TABLE `host_details` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', `name` varchar(255) NOT NULL, `value` varchar(255) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `uk_host_id_name` (`host_id`,`name`), KEY `fk_host_details__host_id` (`host_id`), CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES `host` (`id`) ON DELETE CASCADE ) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8; > an ACS db upgraded from Pre-4.0 version is missing unique key constraint on > host_details > > > Key: CLOUDSTACK-6195 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6195 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Upgrade >Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, 4.2.1, 4.1.2 > Environment: Pre-4.0 db upgraded to 4.x. We have confirmed this bug > in a db that started out as 2.2.14. >Reporter: Joris van Lieshout > > This is the table in our 4.2.1 env that has been upgraded from 2.2.14. > CREATE TABLE `host_details` ( > `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, > `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', > `name` varchar(255) NOT NULL, > `value` varchar(255) NOT NULL, > PRIMARY KEY (`id`), > KEY `fk_host_details__host_id` (`host_id`), > CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES > `host` (`id`) ON DELETE CASCADE > ) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8; > And this is the table of a fresh 4.x install: > CREATE TABLE `host_details` ( > `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, > `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', > `name` varchar(255) NOT NULL, > `value` varchar(255) NOT NULL, > PRIMARY KEY (`id`), > UNIQUE KEY `uk_host_id_name` (`host_id`,`name`), > KEY `fk_host_details__host_id` (`host_id`), > CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES > `host` (`id`) ON DELETE CASCADE > ) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8; > The effect of this missing bug is a lot of duplicate entries in the > host_details table. The duplicate information on the host_details table > causes the api call listHosts to return the same host tag multiple time (to > be exact: the number of duplicate entries in the host_details table for that > host). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CLOUDSTACK-6195) an ACS db upgraded from Pre-4.0 version is missing unique key constraint on host_details
Joris van Lieshout created CLOUDSTACK-6195: -- Summary: an ACS db upgraded from Pre-4.0 version is missing unique key constraint on host_details Key: CLOUDSTACK-6195 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6195 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Components: Upgrade Affects Versions: 4.2.0, 4.1.0, 4.0.2, 4.0.1, 4.0.0, 4.1.1, 4.2.1, 4.1.2 Environment: Pre-4.0 db upgraded to 4.x. We have confirmed this bug in a db that started out as 2.2.14. Reporter: Joris van Lieshout This is the table in our 4.2.1 env that has been upgraded from 2.2.14. CREATE TABLE `host_details` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', `name` varchar(255) NOT NULL, `value` varchar(255) NOT NULL, PRIMARY KEY (`id`), KEY `fk_host_details__host_id` (`host_id`), CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES `host` (`id`) ON DELETE CASCADE ) ENGINE=InnoDB AUTO_INCREMENT=752966 DEFAULT CHARSET=utf8; And this is the table of a fresh 4.x install: CREATE TABLE `host_details` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `host_id` bigint(20) unsigned NOT NULL COMMENT 'host id', `name` varchar(255) NOT NULL, `value` varchar(255) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `uk_host_id_name` (`host_id`,`name`), KEY `fk_host_details__host_id` (`host_id`), CONSTRAINT `fk_host_details__host_id` FOREIGN KEY (`host_id`) REFERENCES `host` (`id`) ON DELETE CASCADE ) ENGINE=InnoDB AUTO_INCREMENT=242083 DEFAULT CHARSET=utf8; -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892005#comment-13892005 ] Joris van Lieshout commented on CLOUDSTACK-6023: We will be installing on our test env a custom build of 4.2.1 that has a max of 16 today. I should be able to answer your question in a couple day. Theoretically however looking at the current size of the POST and the number of instance with vcpumax=32 setting it to 16 will make a big difference. > Non windows instances are created on XenServer with a vcpu-max above > supported xenserver limits > --- > > Key: CLOUDSTACK-6023 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: XenServer >Affects Versions: Future, 4.2.1, 4.3.0 >Reporter: Joris van Lieshout >Priority: Blocker > Attachments: xentop.png > > > CitrixResourceBase.java contains a hardcoded value for vcpusmax for non > windows instances: > if (guestOsTypeName.toLowerCase().contains("windows")) { > vmr.VCPUsMax = (long) vmSpec.getCpus(); > } else { > vmr.VCPUsMax = 32L; > } > For all currently available versions of XenServer the limit is 16vcpus: > http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf > http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf > http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf > In addition there seems to be a limit to the total amount of assigned vpcus > on a XenServer. > The impact of this bug is that xapi becomes unstable and keeps losing it's > master_connection because the POST to the /remote_db_access is bigger then > it's limit of 200K. This basically renders a pool slave unmanageable. > If you would look at the running instances using xentop you will see hosts > reporting with 32 vcpus > Below the relevant portion of the xensource.log that shows the effect of the > bug: > [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: Using commandline: > /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6 > [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork > (43,30540)) > [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: stunnel start > [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40 > [20140204T13:52:17.346Z|error|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] Received HTTP error 500 ({ method = POST; > uri = /remote_db_access; query = [ ]; content_length = [ 315932 ]; transfer > encoding = ; version = 1.1; cookie = [ > pool_secret=386bbf39-8710-4d2d-f452-9725d79c2393/aa7bcda9-8ebb-0cef-bb77-c6b496c5d859/1f928d82-7a20-9117-dd30-f96c7349b16e > ]; task = ; subtask_of = ; content-type = ; user_agent = xapi/1.9 }) from > master. This suggests our master address is wrong. Sleeping for 60s and then > restarting. > [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Caught Master_connection.Goto_handler > [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Connection to master died. I will continue > to retry indefinitely (supressing future logging of this message). > [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Connection to master died. I will continue > to retry indefinitely (supressing future logging of this message). > [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Sleeping 2.00 seconds before retrying > master connection... > [20140204T13:53:20.627Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel: Using commandline: > /usr/sbin/stunnel -fd 3c8aed8e-1fce-be7c-09f8-b45cdc40a1f5 > [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel: stunnel has pidty: (FEFork > (23,31207)) > [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel: stunnel start > [20140204T13:53:20.632Z| info|xense
[jira] [Commented] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891982#comment-13891982 ] Joris van Lieshout commented on CLOUDSTACK-6023: That is a good idea. Nice solution. > Non windows instances are created on XenServer with a vcpu-max above > supported xenserver limits > --- > > Key: CLOUDSTACK-6023 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: XenServer >Affects Versions: Future, 4.2.1, 4.3.0 >Reporter: Joris van Lieshout >Priority: Blocker > Attachments: xentop.png > > > CitrixResourceBase.java contains a hardcoded value for vcpusmax for non > windows instances: > if (guestOsTypeName.toLowerCase().contains("windows")) { > vmr.VCPUsMax = (long) vmSpec.getCpus(); > } else { > vmr.VCPUsMax = 32L; > } > For all currently available versions of XenServer the limit is 16vcpus: > http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf > http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf > http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf > In addition there seems to be a limit to the total amount of assigned vpcus > on a XenServer. > The impact of this bug is that xapi becomes unstable and keeps losing it's > master_connection because the POST to the /remote_db_access is bigger then > it's limit of 200K. This basically renders a pool slave unmanageable. > If you would look at the running instances using xentop you will see hosts > reporting with 32 vcpus > Below the relevant portion of the xensource.log that shows the effect of the > bug: > [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: Using commandline: > /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6 > [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork > (43,30540)) > [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: stunnel start > [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40 > [20140204T13:52:17.346Z|error|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] Received HTTP error 500 ({ method = POST; > uri = /remote_db_access; query = [ ]; content_length = [ 315932 ]; transfer > encoding = ; version = 1.1; cookie = [ > pool_secret=386bbf39-8710-4d2d-f452-9725d79c2393/aa7bcda9-8ebb-0cef-bb77-c6b496c5d859/1f928d82-7a20-9117-dd30-f96c7349b16e > ]; task = ; subtask_of = ; content-type = ; user_agent = xapi/1.9 }) from > master. This suggests our master address is wrong. Sleeping for 60s and then > restarting. > [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Caught Master_connection.Goto_handler > [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Connection to master died. I will continue > to retry indefinitely (supressing future logging of this message). > [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Connection to master died. I will continue > to retry indefinitely (supressing future logging of this message). > [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Sleeping 2.00 seconds before retrying > master connection... > [20140204T13:53:20.627Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel: Using commandline: > /usr/sbin/stunnel -fd 3c8aed8e-1fce-be7c-09f8-b45cdc40a1f5 > [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel: stunnel has pidty: (FEFork > (23,31207)) > [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel: stunnel start > [20140204T13:53:20.632Z| info|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel connected pid=31207 fd=20 > [20140204T13:53:28.874Z|error|xenserverhost1|4 > unix-RPC|session.login_with_password D:2e7664ad69ed|master_connection] Caught > Master_connection.Goto_ha
[jira] [Updated] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris van Lieshout updated CLOUDSTACK-6023: --- Attachment: xentop.png > Non windows instances are created on XenServer with a vcpu-max above > supported xenserver limits > --- > > Key: CLOUDSTACK-6023 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: XenServer >Affects Versions: Future, 4.2.1, 4.3.0 >Reporter: Joris van Lieshout >Priority: Blocker > Attachments: xentop.png > > > CitrixResourceBase.java contains a hardcoded value for vcpusmax for non > windows instances: > if (guestOsTypeName.toLowerCase().contains("windows")) { > vmr.VCPUsMax = (long) vmSpec.getCpus(); > } else { > vmr.VCPUsMax = 32L; > } > For all currently available versions of XenServer the limit is 16vcpus: > http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf > http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf > http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf > In addition there seems to be a limit to the total amount of assigned vpcus > on a XenServer. > The impact of this bug is that xapi becomes unstable and keeps losing it's > master_connection because the POST to the /remote_db_access is bigger then > it's limit of 200K. This basically renders a pool slave unmanageable. > If you would look at the running instances using xentop you will see hosts > reporting with 32 vcpus > Below the relevant portion of the xensource.log that shows the effect of the > bug: > [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: Using commandline: > /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6 > [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork > (43,30540)) > [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: stunnel start > [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40 > [20140204T13:52:17.346Z|error|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] Received HTTP error 500 ({ method = POST; > uri = /remote_db_access; query = [ ]; content_length = [ 315932 ]; transfer > encoding = ; version = 1.1; cookie = [ > pool_secret=386bbf39-8710-4d2d-f452-9725d79c2393/aa7bcda9-8ebb-0cef-bb77-c6b496c5d859/1f928d82-7a20-9117-dd30-f96c7349b16e > ]; task = ; subtask_of = ; content-type = ; user_agent = xapi/1.9 }) from > master. This suggests our master address is wrong. Sleeping for 60s and then > restarting. > [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Caught Master_connection.Goto_handler > [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Connection to master died. I will continue > to retry indefinitely (supressing future logging of this message). > [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Connection to master died. I will continue > to retry indefinitely (supressing future logging of this message). > [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Sleeping 2.00 seconds before retrying > master connection... > [20140204T13:53:20.627Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel: Using commandline: > /usr/sbin/stunnel -fd 3c8aed8e-1fce-be7c-09f8-b45cdc40a1f5 > [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel: stunnel has pidty: (FEFork > (23,31207)) > [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel: stunnel start > [20140204T13:53:20.632Z| info|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] stunnel connected pid=31207 fd=20 > [20140204T13:53:28.874Z|error|xenserverhost1|4 > unix-RPC|session.login_with_password D:2e7664ad69ed|master_connection] Caught > Master_connection.Goto_handler > [20140204T13:53:28.874Z|debug|xenserverhost1|4 > unix-RPC
[jira] [Comment Edited] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891934#comment-13891934 ] Joris van Lieshout edited comment on CLOUDSTACK-6023 at 2/5/14 9:03 AM: Hi Hrikrishna, We came to this conclusion by using tcpdump to capture the POST that got returned with a http 500 error from the pool master. This post, which exceeded the 300k limit of xapi rpc, contained for each vm the stats for each of the 32 vpcus (even though the instances where just using 1 vcpu) thus making this post exceed the 300K limit. We are encountering this issue on a host running just 59 instances (inc 36 router vms that use just 1 vcpu but have a vcpumax of 32). My suggestion to resolve this issue would be to make the vcpu-max a configurable variable of a service/compute offering with a default of vcpusmax=vcpus unless otherwise configured in the offering. in addition I do wonder why there is is a descrepency between the XenServer Configuration Limits documentation and the documents you are refering to. In the end we are actively experiencing this issue. I've attached a screen print of xentop on one of our xenserver 6.0.2 host with this issue. If it will helps I can attach the packet capture with the post? was (Author: jvanliesh...@schubergphilis.com): Hi Hrikrishna, We came to this conclusion by using tcpdump to capture the POST that got returned with a http 500 error from the pool master. This post, which exceeded the 300k limit of xapi rpc, contained for each vm the stats for each of the 32 vpcus (even though the instances where just using 1 vcpu) thus making this post exceed the 300K limit. We are encountering this issue on a host running just 59 instances (inc 36 router vms that use just 1 vcpu but have a vcpumax of 32). My suggestion to resolve this issue would be to make the vcpu-max a configurable variable of a service/compute offering with a default of vcpusmax=vcpus unless otherwise configured in the offering. in addition I do wonder why there is is a descrepency between the XenServer Configuration Limits documentation and the documents you are refering to. In the end we are actively experiencing this issue. I've attached a screen print of xentop on one of our xenserver 6.0.2 host with this issue. > Non windows instances are created on XenServer with a vcpu-max above > supported xenserver limits > --- > > Key: CLOUDSTACK-6023 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: XenServer >Affects Versions: Future, 4.2.1, 4.3.0 >Reporter: Joris van Lieshout >Priority: Blocker > Attachments: xentop.png > > > CitrixResourceBase.java contains a hardcoded value for vcpusmax for non > windows instances: > if (guestOsTypeName.toLowerCase().contains("windows")) { > vmr.VCPUsMax = (long) vmSpec.getCpus(); > } else { > vmr.VCPUsMax = 32L; > } > For all currently available versions of XenServer the limit is 16vcpus: > http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf > http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf > http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf > In addition there seems to be a limit to the total amount of assigned vpcus > on a XenServer. > The impact of this bug is that xapi becomes unstable and keeps losing it's > master_connection because the POST to the /remote_db_access is bigger then > it's limit of 200K. This basically renders a pool slave unmanageable. > If you would look at the running instances using xentop you will see hosts > reporting with 32 vcpus > Below the relevant portion of the xensource.log that shows the effect of the > bug: > [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: Using commandline: > /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6 > [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork > (43,30540)) > [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: stunnel start > [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40 > [20140204T13:52:17.346Z|error|xenserve
[jira] [Commented] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits
[ https://issues.apache.org/jira/browse/CLOUDSTACK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891934#comment-13891934 ] Joris van Lieshout commented on CLOUDSTACK-6023: Hi Hrikrishna, We came to this conclusion by using tcpdump to capture the POST that got returned with a http 500 error from the pool master. This post, which exceeded the 300k limit of xapi rpc, contained for each vm the stats for each of the 32 vpcus (even though the instances where just using 1 vcpu) thus making this post exceed the 300K limit. We are encountering this issue on a host running just 59 instances (inc 36 router vms that use just 1 vcpu but have a vcpumax of 32). My suggestion to resolve this issue would be to make the vcpu-max a configurable variable of a service/compute offering with a default of vcpusmax=vcpus unless otherwise configured in the offering. in addition I do wonder why there is is a descrepency between the XenServer Configuration Limits documentation and the documents you are refering to. In the end we are actively experiencing this issue. I've attached a screen print of xentop on one of our xenserver 6.0.2 host with this issue. > Non windows instances are created on XenServer with a vcpu-max above > supported xenserver limits > --- > > Key: CLOUDSTACK-6023 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: XenServer >Affects Versions: Future, 4.2.1, 4.3.0 >Reporter: Joris van Lieshout >Priority: Blocker > > CitrixResourceBase.java contains a hardcoded value for vcpusmax for non > windows instances: > if (guestOsTypeName.toLowerCase().contains("windows")) { > vmr.VCPUsMax = (long) vmSpec.getCpus(); > } else { > vmr.VCPUsMax = 32L; > } > For all currently available versions of XenServer the limit is 16vcpus: > http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf > http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf > http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf > In addition there seems to be a limit to the total amount of assigned vpcus > on a XenServer. > The impact of this bug is that xapi becomes unstable and keeps losing it's > master_connection because the POST to the /remote_db_access is bigger then > it's limit of 200K. This basically renders a pool slave unmanageable. > If you would look at the running instances using xentop you will see hosts > reporting with 32 vcpus > Below the relevant portion of the xensource.log that shows the effect of the > bug: > [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: Using commandline: > /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6 > [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork > (43,30540)) > [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel: stunnel start > [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40 > [20140204T13:52:17.346Z|error|xenserverhost1|144 inet-RPC|host.call_plugin > R:e58e985539ab|master_connection] Received HTTP error 500 ({ method = POST; > uri = /remote_db_access; query = [ ]; content_length = [ 315932 ]; transfer > encoding = ; version = 1.1; cookie = [ > pool_secret=386bbf39-8710-4d2d-f452-9725d79c2393/aa7bcda9-8ebb-0cef-bb77-c6b496c5d859/1f928d82-7a20-9117-dd30-f96c7349b16e > ]; task = ; subtask_of = ; content-type = ; user_agent = xapi/1.9 }) from > master. This suggests our master address is wrong. Sleeping for 60s and then > restarting. > [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Caught Master_connection.Goto_handler > [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Connection to master died. I will continue > to retry indefinitely (supressing future logging of this message). > [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update > D:5c5376f0da6c|master_connection] Connection to master died. I will continue > to retry indefinitely (supressing future logging of this message). > [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0
[jira] [Created] (CLOUDSTACK-6024) template copy to primary storage uses a random source secstorage from any zone
Joris van Lieshout created CLOUDSTACK-6024: -- Summary: template copy to primary storage uses a random source secstorage from any zone Key: CLOUDSTACK-6024 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6024 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Affects Versions: 4.2.1, 4.3.0 Environment: Multiple zones where the secstorage of a zone is not accessible to hosts from the other zone. Reporter: Joris van Lieshout Priority: Critical 2014-02-04 15:19:07,674 DEBUG [cloud.storage.VolumeManagerImpl] (Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) Checking if we need to prepare 1 volumes for VM[User|xx-app01] 2014-02-04 15:19:07,693 DEBUG [storage.image.TemplateDataFactoryImpl] (Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) template 467 is already in store:117, type:Image // store 117 is not accessible from the zone where this hypervisor lives 2014-02-04 15:19:07,705 DEBUG [storage.datastore.PrimaryDataStoreImpl] (Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) Not found (templateId:467poolId:208) in template_spool_ref, persisting it 2014-02-04 15:19:07,718 DEBUG [storage.image.TemplateDataFactoryImpl] (Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) template 467 is already in store:208, type:Primary 2014-02-04 15:19:07,722 DEBUG [storage.volume.VolumeServiceImpl] (Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) Found template 467-2-6c05b599-95ed-34c3-b8f0-fd9c30bac938 in storage pool 208 with VMTemplateStoragePool id: 36433 2014-02-04 15:19:07,732 DEBUG [storage.volume.VolumeServiceImpl] (Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) Acquire lock on VMTemplateStoragePool 36433 with timeout 3600 seconds 2014-02-04 15:19:07,737 INFO [storage.volume.VolumeServiceImpl] (Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) lock is acquired for VMTemplateStoragePool 36433 2014-02-04 15:19:07,748 DEBUG [storage.motion.AncientDataMotionStrategy] (Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) copyAsync inspecting src type TEMPLATE copyAsync inspecting dest type TEMPLATE 2014-02-04 15:19:07,775 DEBUG [agent.manager.ClusteredAgentAttache] (Job-Executor-92:job-221857 = [ 6f2d5dbb-575e-49b9-89dd-d7567869849e ]) Seq 93-1862347354: Forwarding Seq 93-1862347354: { Cmd , MgmtId: 345052370018, via: 93, Ver: v1, Flags: 100111, [{"org.apache.cloudstack.storage.command.CopyCommand":{"srcTO":{"org.apache.cloudstack.storage.to.TemplateObjectTO":{"path":"template/tmpl/2/467/c263eb76-3d72-3732-8cc6-42b0dad55c4d.vhd","origUrl":"http://x.x.com/image/centos64x64-daily-v1b104.vhd","uuid":"ca5e3f26-e9b6-41c8-a85b-df900be5673c","id":467,"format":"VHD","accountId":2,"checksum":"604a8327bd83850ed621ace2ea84402a","hvm":true,"displayText":"centos template created by hans.pl from machine name centos-daily-b104","imageDataStore":{"com.cloud.agent.api.to.NfsTO":{"_url":"nfs://.storage..xx.xxx/volumes/pool0/--1-1","_role":"Image"}},"name":"467-2-6c05b599-95ed-34c3-b8f0-fd9c30bac938","hypervisorType":"XenServer"}},"destTO":{"org.apache.cloudstack.storage.to.TemplateObjectTO":{"origUrl":"http://xx.xx.com/image/centos64x64-daily-v1b104.vhd","uuid":"ca5e3f26-e9b6-41c8-a85b-df900be5673c","id":467,"format":"VHD","accountId":2,"checksum":"604a8327bd83850ed621ace2ea84402a","hvm":true,"displayText":"centos template created by hans.pl from machine name centos-daily-b104","imageDataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"b290385b-466d-3243-a939-3d242164e034","id":208,"poolType":"NetworkFilesystem","host":"..x.net","path":"/volumes/pool0/xx-XEN-1","port":2049}},"name":"467-2-6c05b599-95ed-34c3-b8f0-fd9c30bac938","hypervisorType":"XenServer"}},"executeInSequence":true,"wait":10800}}] } to 345052370017 ===FILE: server/src/com/cloud/storage/VolumeManagerImpl.java public void prepare(VirtualMachineProfile vm, DeployDestination dest) throws StorageUnavailableException, InsufficientStorageCapacityException, ConcurrentOperationException { if (dest == null) { if (s_logger.isDebugEnabled()) { s_logger.debug("DeployDestination cannot be null, cannot prepare Volumes for the vm: " + vm); } throw new CloudRuntimeException( "Unable to prepare Volume for vm because DeployDestination is null, vm:" + vm); } List vols = _volsDao.findUsableVolumesForInstance(vm.getId()); if (s_logger.isDebugEnabled()) { s_logger.debug("C
[jira] [Created] (CLOUDSTACK-6023) Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits
Joris van Lieshout created CLOUDSTACK-6023: -- Summary: Non windows instances are created on XenServer with a vcpu-max above supported xenserver limits Key: CLOUDSTACK-6023 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6023 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Components: XenServer Affects Versions: 4.2.1 Reporter: Joris van Lieshout Priority: Blocker CitrixResourceBase.java contains a hardcoded value for vcpusmax for non windows instances: if (guestOsTypeName.toLowerCase().contains("windows")) { vmr.VCPUsMax = (long) vmSpec.getCpus(); } else { vmr.VCPUsMax = 32L; } For all currently available versions of XenServer the limit is 16vcpus: http://support.citrix.com/servlet/KbServlet/download/28909-102-664115/XenServer-6.0-Configuration-Limits.pdf http://support.citrix.com/servlet/KbServlet/download/32312-102-704653/CTX134789%20-%20XenServer%206.1.0_Configuration%20Limits.pdf http://support.citrix.com/servlet/KbServlet/download/34966-102-706122/CTX137837_XenServer%206_2_0_Configuration%20Limits.pdf In addition there seems to be a limit to the total amount of assigned vpcus on a XenServer. The impact of this bug is that xapi becomes unstable and keeps losing it's master_connection because the POST to the /remote_db_access is bigger then it's limit of 200K. This basically renders a pool slave unmanageable. If you would look at the running instances using xentop you will see hosts reporting with 32 vcpus Below the relevant portion of the xensource.log that shows the effect of the bug: [20140204T13:52:17.264Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin R:e58e985539ab|master_connection] stunnel: Using commandline: /usr/sbin/stunnel -fd f3b8bb12-4e03-b47a-0dc5-85ad5aef79e6 [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin R:e58e985539ab|master_connection] stunnel: stunnel has pidty: (FEFork (43,30540)) [20140204T13:52:17.269Z|debug|xenserverhost1|144 inet-RPC|host.call_plugin R:e58e985539ab|master_connection] stunnel: stunnel start [20140204T13:52:17.269Z| info|xenserverhost1|144 inet-RPC|host.call_plugin R:e58e985539ab|master_connection] stunnel connected pid=30540 fd=40 [20140204T13:52:17.346Z|error|xenserverhost1|144 inet-RPC|host.call_plugin R:e58e985539ab|master_connection] Received HTTP error 500 ({ method = POST; uri = /remote_db_access; query = [ ]; content_length = [ 315932 ]; transfer encoding = ; version = 1.1; cookie = [ pool_secret=386bbf39-8710-4d2d-f452-9725d79c2393/aa7bcda9-8ebb-0cef-bb77-c6b496c5d859/1f928d82-7a20-9117-dd30-f96c7349b16e ]; task = ; subtask_of = ; content-type = ; user_agent = xapi/1.9 }) from master. This suggests our master address is wrong. Sleeping for 60s and then restarting. [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update D:5c5376f0da6c|master_connection] Caught Master_connection.Goto_handler [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update D:5c5376f0da6c|master_connection] Connection to master died. I will continue to retry indefinitely (supressing future logging of this message). [20140204T13:53:18.620Z|error|xenserverhost1|10|dom0 networking update D:5c5376f0da6c|master_connection] Connection to master died. I will continue to retry indefinitely (supressing future logging of this message). [20140204T13:53:18.620Z|debug|xenserverhost1|10|dom0 networking update D:5c5376f0da6c|master_connection] Sleeping 2.00 seconds before retrying master connection... [20140204T13:53:20.627Z|debug|xenserverhost1|10|dom0 networking update D:5c5376f0da6c|master_connection] stunnel: Using commandline: /usr/sbin/stunnel -fd 3c8aed8e-1fce-be7c-09f8-b45cdc40a1f5 [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update D:5c5376f0da6c|master_connection] stunnel: stunnel has pidty: (FEFork (23,31207)) [20140204T13:53:20.632Z|debug|xenserverhost1|10|dom0 networking update D:5c5376f0da6c|master_connection] stunnel: stunnel start [20140204T13:53:20.632Z| info|xenserverhost1|10|dom0 networking update D:5c5376f0da6c|master_connection] stunnel connected pid=31207 fd=20 [20140204T13:53:28.874Z|error|xenserverhost1|4 unix-RPC|session.login_with_password D:2e7664ad69ed|master_connection] Caught Master_connection.Goto_handler [20140204T13:53:28.874Z|debug|xenserverhost1|4 unix-RPC|session.login_with_password D:2e7664ad69ed|master_connection] Connection to master died. I will continue to retry indefinitely (supressing future logging of this message). [20140204T13:53:28.874Z|error|xenserverhost1|4 unix-RPC|session.login_with_password D:2e7664ad69ed|master_connection] Connection to master died. I will continue to retry indefinitely (supressing future logging of this message). [20140204T13:53:28.8
[jira] [Created] (CLOUDSTACK-6020) createPortForwardingRule failes for vmguestip above 127.255.255.255
Joris van Lieshout created CLOUDSTACK-6020: -- Summary: createPortForwardingRule failes for vmguestip above 127.255.255.255 Key: CLOUDSTACK-6020 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6020 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Components: API Affects Versions: 4.2.0, 4.1.0, 4.0.2, 4.0.1, 4.0.0, pre-4.0.0, 4.1.1, Future, 4.2.1, 4.1.2, 4.3.0, 4.4.0 Reporter: Joris van Lieshout command=createPortForwardingRule&response=json&sessionkey=FmHQb9oGmgKlM4ihB%2Fb2ik7p35E%3D&ipaddressid=d29bebfe-edc1-406f-b4ed-7a49c6e7ee1f&privateport=80&privateendport=80&publicport=80&publicendport=80&protocol=tcp&virtualmachineid=cc5c9dc4-3eeb-4533-994a-0e2636a48a60&openfirewall=false&vmguestip=192.168.1.30&networkid=5e56227c-83c0-4b85-8a27-53343e806d12&_=1391510423905 vmguestip=192.168.1.30 api/src/org/apache/cloudstack/api/command/user/firewall/CreatePortForwardingRuleCmd.java @Parameter(name = ApiConstants.VM_GUEST_IP, type = CommandType.STRING, required = false, description = "VM guest nic Secondary ip address for the port forwarding rule") private String vmSecondaryIp; @Override public void create() { // cidr list parameter is deprecated if (cidrlist != null) { throw new InvalidParameterValueException("Parameter cidrList is deprecated; if you need to open firewall rule for the specific cidr, please refer to createFirewallRule command"); } Ip privateIp = getVmSecondaryIp(); if (privateIp != null) { if ( !privateIp.isIp4()) { throw new InvalidParameterValueException("Invalid vm ip address"); } } try { PortForwardingRule result = _rulesService.createPortForwardingRule(this, virtualMachineId, privateIp, getOpenFirewall()); setEntityId(result.getId()); setEntityUuid(result.getUuid()); } catch (NetworkRuleConflictException ex) { s_logger.info("Network rule conflict: " , ex); s_logger.trace("Network Rule Conflict: ", ex); throw new ServerApiException(ApiErrorCode.NETWORK_RULE_CONFLICT_ERROR, ex.getMessage()); } } utils/src/com/cloud/utils/net/Ip.java public boolean isIp4() { return ip < Integer.MAX_VALUE; } public Ip(String ip) { this.ip = NetUtils.ip2Long(ip); } === ip2long for 192.168.1.30 => 3232235806 === Integer.MAX_VALUE => 231-1 = 2147483647 3232235806 (192.168.1.30) is therefore bigger then MAX_VALUE making isIp4() return FALSE and throwing a InvalidParameterValueException… -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (CLOUDSTACK-692) The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in the process of being created.
[ https://issues.apache.org/jira/browse/CLOUDSTACK-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris van Lieshout updated CLOUDSTACK-692: -- Summary: The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in the process of being created. (was: The StorageManager-Scavenger deletes snapshots that are still in the process of being created.) > The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in > the process of being created. > --- > > Key: CLOUDSTACK-692 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-692 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Snapshot >Reporter: Joris van Lieshout >Priority: Minor > > Hi there, > I think we ran into a bug due to a concurrence of circumstances regarding > snapshotting and the cleanup of snapshots. > The CleanupSnapshotBackup process on the SSVM deletes vhd files that are not > known in the database but when, especially long running snapshot, are being > copied to secondary storeage there is a gap between the start and finish of > the VDI-copy, where the uuid of the destination vhd is not registered in the > database. If the CleanupSnapshotBackup deletes the destinaion vhd during this > window it results in hanging sparse_dd process on the XenServer hypervisor > pointing to a tapdisk2 process with no file behind it. > ===Secondary storage vm (2 hour time difference due to time zone). second to > last line you see the vhd being deleted. > 2013-09-04 03:14:45,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) > Request:Seq 261-1870805144: { Cmd , MgmtId: 345052370018, via: 261, Ver: v1, > Flags: 100011, > [{"CleanupSnapshotBackupCommand":{"secondaryStoragePoolURL":"nfs://mccpnas7.storage.mccp.mcinfra.net/volumes/pool0/MCCP-SHARED-1-1","dcId":1,"accountId":45,"volumeId":5863,"validBackupUUIDs":["1a56760b-d1c0-4620-8cf7-271951500d70","b6157bc9-085b-4ed6-95c2-4341f31c64bf","1ff967e3-3606-4112-9155-b1145b2ef576","12fbe4e3-1fdd-4c35-a961-0fce07cff584","278e9915-4f94-40c8-bef4-9c6bc82d4653","6fba1dd7-4736-47b3-9eed-148304c0e192","b9d8c9d8-6445-463b-b4e1-ab3b3f3a67a2","40ba5d72-c69a-46c2-973b-0570c1cabeac","774f2b0e-cdaf-4594-a9f9-4f872dcaad6e","8269f50b-6bec-427c-8186-540df6a75dbf","7b0c6e75-40cf-4dd7-826a-09b39f3da7b5","df7eac9c-137a-4655-9d21-d781916351f1","11ec2db1-a2fc-4221-ae1a-c1ab2bd59509","dfc348e1-af50-4d77-b4a0-6e86fc954e1c","98f64c0f-7498-4c70-8b70-beaefd723b45","c42f9dd5-079d-4b77-86dc-c19b7fbed817"],"wait":0}}] > } > 2013-09-04 03:14:45,722 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) > Processing command: com.cloud.agent.api.CleanupSnapshotBackupCommand > 2013-09-04 03:14:45,723 DEBUG [storage.resource.NfsSecondaryStorageResource] > (agentRequest-Handler-2:) Executing: mount > 2013-09-04 03:14:45,732 DEBUG [storage.resource.NfsSecondaryStorageResource] > (agentRequest-Handler-2:) Execution is successful. > 2013-09-04 03:14:45,772 WARN [storage.resource.NfsSecondaryStorageResource] > (agentRequest-Handler-2:) snapshot 8ca9fea4-8a98-4cc3-bba7-cc1dcf32bb24.vhd > is not recorded in DB, remove it > 2013-09-04 03:14:45,772 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) > Seq 261-1870805144: { Ans: , MgmtId: 345052370018, via: 261, Ver: v1, Flags: > 10, [{"Answer":{"result":true,"wait":0}}] } > management-server.log. here you see the snapshot being created, the > copyToSecStorage process starting, eventually timing out due to the hanging > vdi-copy, failing on retrying because vdi in use (although not existing any > more the vdi is still know on xen), retrying some more on another HV and > eventuall giving up because it tries to create a duplicate SR. > 2013-09-04 04:27:10,931 DEBUG [cloud.async.AsyncJobManagerImpl] > (Job-Executor-69:job-95137) Executing > com.cloud.api.commands.CreateSnapshotCmd for job-95137 > 2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] > (Job-Executor-69:job-95137) Seq 91-780303147: Sending { Cmd , MgmtId: > 345052370017, via: 91, Ver: v1, Flags: 100011, > [{"ManageSnapshotCommand":{"_commandSwitch":"-c","_volumePath":"9cb7af90-ca88-4b34-aa6f-bc21c3d4a3aa","_pool":{"id":208,"uuid":"b290385b-466d-3243-a939-3d242164e034","host":"mccpnas3-4-vip1.mccp.mcinfra.net","path":"/volumes/pool0/MCCP-S-SBP1-1_MCCP-XEN-1","port":2049,"type":"NetworkFilesystem"},"_snapshotName":"vlstws3_ROOT-2736_20130904022710","_snapshotId":71889,"_vmName":"i-45-2736-VM","wait":0}}] > } > 2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] > (Job-Executor-69:job-95137) Seq 91-780303147: Executing: { Cmd , MgmtId: > 345052370017, via: 91, Ver: v1, Flags: 100011, > [{"ManageSnapshotComma
[jira] [Updated] (CLOUDSTACK-692) The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in the process of being copied to secondary storage.
[ https://issues.apache.org/jira/browse/CLOUDSTACK-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris van Lieshout updated CLOUDSTACK-692: -- Summary: The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in the process of being copied to secondary storage. (was: The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in the process of being created.) > The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in > the process of being copied to secondary storage. > --- > > Key: CLOUDSTACK-692 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-692 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Snapshot >Reporter: Joris van Lieshout >Priority: Minor > > Hi there, > I think we ran into a bug due to a concurrence of circumstances regarding > snapshotting and the cleanup of snapshots. > The CleanupSnapshotBackup process on the SSVM deletes vhd files that are not > known in the database but when, especially long running snapshot, are being > copied to secondary storeage there is a gap between the start and finish of > the VDI-copy, where the uuid of the destination vhd is not registered in the > database. If the CleanupSnapshotBackup deletes the destinaion vhd during this > window it results in hanging sparse_dd process on the XenServer hypervisor > pointing to a tapdisk2 process with no file behind it. > ===Secondary storage vm (2 hour time difference due to time zone). second to > last line you see the vhd being deleted. > 2013-09-04 03:14:45,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) > Request:Seq 261-1870805144: { Cmd , MgmtId: 345052370018, via: 261, Ver: v1, > Flags: 100011, > [{"CleanupSnapshotBackupCommand":{"secondaryStoragePoolURL":"nfs://mccpnas7.storage.mccp.mcinfra.net/volumes/pool0/MCCP-SHARED-1-1","dcId":1,"accountId":45,"volumeId":5863,"validBackupUUIDs":["1a56760b-d1c0-4620-8cf7-271951500d70","b6157bc9-085b-4ed6-95c2-4341f31c64bf","1ff967e3-3606-4112-9155-b1145b2ef576","12fbe4e3-1fdd-4c35-a961-0fce07cff584","278e9915-4f94-40c8-bef4-9c6bc82d4653","6fba1dd7-4736-47b3-9eed-148304c0e192","b9d8c9d8-6445-463b-b4e1-ab3b3f3a67a2","40ba5d72-c69a-46c2-973b-0570c1cabeac","774f2b0e-cdaf-4594-a9f9-4f872dcaad6e","8269f50b-6bec-427c-8186-540df6a75dbf","7b0c6e75-40cf-4dd7-826a-09b39f3da7b5","df7eac9c-137a-4655-9d21-d781916351f1","11ec2db1-a2fc-4221-ae1a-c1ab2bd59509","dfc348e1-af50-4d77-b4a0-6e86fc954e1c","98f64c0f-7498-4c70-8b70-beaefd723b45","c42f9dd5-079d-4b77-86dc-c19b7fbed817"],"wait":0}}] > } > 2013-09-04 03:14:45,722 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) > Processing command: com.cloud.agent.api.CleanupSnapshotBackupCommand > 2013-09-04 03:14:45,723 DEBUG [storage.resource.NfsSecondaryStorageResource] > (agentRequest-Handler-2:) Executing: mount > 2013-09-04 03:14:45,732 DEBUG [storage.resource.NfsSecondaryStorageResource] > (agentRequest-Handler-2:) Execution is successful. > 2013-09-04 03:14:45,772 WARN [storage.resource.NfsSecondaryStorageResource] > (agentRequest-Handler-2:) snapshot 8ca9fea4-8a98-4cc3-bba7-cc1dcf32bb24.vhd > is not recorded in DB, remove it > 2013-09-04 03:14:45,772 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) > Seq 261-1870805144: { Ans: , MgmtId: 345052370018, via: 261, Ver: v1, Flags: > 10, [{"Answer":{"result":true,"wait":0}}] } > management-server.log. here you see the snapshot being created, the > copyToSecStorage process starting, eventually timing out due to the hanging > vdi-copy, failing on retrying because vdi in use (although not existing any > more the vdi is still know on xen), retrying some more on another HV and > eventuall giving up because it tries to create a duplicate SR. > 2013-09-04 04:27:10,931 DEBUG [cloud.async.AsyncJobManagerImpl] > (Job-Executor-69:job-95137) Executing > com.cloud.api.commands.CreateSnapshotCmd for job-95137 > 2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] > (Job-Executor-69:job-95137) Seq 91-780303147: Sending { Cmd , MgmtId: > 345052370017, via: 91, Ver: v1, Flags: 100011, > [{"ManageSnapshotCommand":{"_commandSwitch":"-c","_volumePath":"9cb7af90-ca88-4b34-aa6f-bc21c3d4a3aa","_pool":{"id":208,"uuid":"b290385b-466d-3243-a939-3d242164e034","host":"mccpnas3-4-vip1.mccp.mcinfra.net","path":"/volumes/pool0/MCCP-S-SBP1-1_MCCP-XEN-1","port":2049,"type":"NetworkFilesystem"},"_snapshotName":"vlstws3_ROOT-2736_20130904022710","_snapshotId":71889,"_vmName":"i-45-2736-VM","wait":0}}] > } > 2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] > (Job-Executor-69:job-95137) Seq 91-780303147: Executing: { Cmd , MgmtId:
[jira] [Commented] (CLOUDSTACK-692) The StorageManager-Scavenger deletes snapshots that are still in the process of being created.
[ https://issues.apache.org/jira/browse/CLOUDSTACK-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13769529#comment-13769529 ] Joris van Lieshout commented on CLOUDSTACK-692: --- How to clean up on XenServer after you have hit this bug: 1. Find the sparse_dd process ps -ef | grep sparse_dd 2. Find the vbd of the destination sparse_dd device xe vbd-list device=${dest device f.i. xvbd} vm-uuid=${UUID of Dom0} 3. Find tapdisk2 process for this vbd xe vbd-param-get uuid=${UUID VBD step2} param-name=vdi-uuid tap-ctl list | grep ${uuid of VDI} ls ${path of vhd from tap-ctl list} 4. Also get the uuid and name of the SR for later use xe vdi-param-get uuid=${ uuid of VDI } param-name=sr-uuid xe vdi-param-get uuid=${ uuid of VDI } param-name=sr-name-label 5. ONLY continue if the vhd does not exist 6. Create a dummy file to make the cleanup process go smooth touch ${path of vhd from tap-ctl list but with .raw instead of .vhd} 7. Kill the sparse_dd process kill -9 ${PID of sparse_dd process step 1} 8. !!! It can take up to 10 minutes for this process to be killed. Only continue when the process is gone !!! ps –ef | grep ${PID of sparse_dd process step 1} 9. Close, detach and free the tapdisk2 process. Get your info from the previous tap-ctl list tap-ctl close -m ${TAPMINOR} -p ${TAPPID} tap-ctl detach -m ${TAPMINOR} -p ${TAPPID} tap-ctl free -m ${TAPMINOR} 10. Now unplug the vbd but put it in background because the process sometimes hangs xe vbd-unplug uuid=${uuid of VBD} & 11. If the vbd unplug hangs check in /var/log/xensource.log to see if it hangs on “watching xenstore paths: [ /local/domain/0/backend/vbd/0/51712/shutdown-done; /local/domain/0/error/device/vbd/51712/error ] with timeout 1200.00 seconds” by searching for the last line containing VBD.unplug. If so, AND ONLY IF SO, execute: xenstore-write /local/domain/0/backend/vbd/0/${get this from the xensourse.log}/shutdown-done Ok 12. It’s now safe to forget all the vdi’s and unplug the pbd and forget the sr. The script below will also do it for stuff on other HVs in the cluster if CS has tried snapshotting there. DESTSRs=`xe sr-list name-label=${name-label of sr (looks like uuid) from step 4, not the uuid of the sr.} --minimal | tr "," "\n"` for SRloop in $DESTSRs do PBD=`xe sr-param-get uuid=$SRloop param-name=PBDs` VDIs=`xe sr-param-get uuid=$SRloop param-name=VDIs | sed 's/;\ */\n/g'` for VDIloop in $VDIs do echo " Forgetting VDI $VDIloop" xe vdi-forget uuid=$VDIloop done echo " Unplugging PBD $PBD" xe pbd-unplug uuid=$PBD echo " Forgetting SR $SRloop" xe sr-forget uuid=$SRloop done 13. And now everything is ready for another snapshot attempt. Let’s hope the Storage Cleanup process keeps its cool. ;) > The StorageManager-Scavenger deletes snapshots that are still in the process > of being created. > -- > > Key: CLOUDSTACK-692 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-692 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Snapshot >Reporter: Joris van Lieshout >Priority: Minor > > Hi there, > I think we ran into a bug due to a concurrence of circumstances regarding > snapshotting and the cleanup of snapshots. > The CleanupSnapshotBackup process on the SSVM deletes vhd files that are not > known in the database but when, especially long running snapshot, are being > copied to secondary storeage there is a gap between the start and finish of > the VDI-copy, where the uuid of the destination vhd is not registered in the > database. If the CleanupSnapshotBackup deletes the destinaion vhd during this > window it results in hanging sparse_dd process on the XenServer hypervisor > pointing to a tapdisk2 process with no file behind it. > ===Secondary storage vm (2 hour time difference due to time zone). second to > last line you see the vhd being deleted. > 2013-09-04 03:14:45,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) > Request:Seq 261-1870805144: { Cmd , MgmtId: 345052370018, via: 261, Ver: v1, > Flags: 100011, > [{"CleanupSnapshotBackupCommand":{"secondaryStoragePoolURL":"nfs://mccpnas7.storage.mccp.mcinfra.net/volumes/pool0/MCCP-SHARED-1-1","dcId":1,"accountId":45,"volumeId":5863,"validBackupUUIDs":["1a56760b-d1c0-4620-8cf7-271951500d70","b6157bc9-085b-4ed6-95c2-4341f31c64bf","1ff967e3-3606-4112-9155-b1145b2ef576","12fbe4e3-1fdd-4c35-a961-0fce07cff584","278e9915-4f94-40c8-bef4-9c6bc82d4653","6fba1dd7-4736-47b3-9eed-148304c0e192","b9d8c9d8-6445-463b-b4e1-ab3b3f3a67a2","40ba5d
[jira] [Updated] (CLOUDSTACK-692) The StorageManager-Scavenger deletes snapshots that are still in the process of being created.
[ https://issues.apache.org/jira/browse/CLOUDSTACK-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris van Lieshout updated CLOUDSTACK-692: -- Description: Hi there, I think we ran into a bug due to a concurrence of circumstances regarding snapshotting and the cleanup of snapshots. The CleanupSnapshotBackup process on the SSVM deletes vhd files that are not known in the database but when, especially long running snapshot, are being copied to secondary storeage there is a gap between the start and finish of the VDI-copy, where the uuid of the destination vhd is not registered in the database. If the CleanupSnapshotBackup deletes the destinaion vhd during this window it results in hanging sparse_dd process on the XenServer hypervisor pointing to a tapdisk2 process with no file behind it. ===Secondary storage vm (2 hour time difference due to time zone). second to last line you see the vhd being deleted. 2013-09-04 03:14:45,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) Request:Seq 261-1870805144: { Cmd , MgmtId: 345052370018, via: 261, Ver: v1, Flags: 100011, [{"CleanupSnapshotBackupCommand":{"secondaryStoragePoolURL":"nfs://mccpnas7.storage.mccp.mcinfra.net/volumes/pool0/MCCP-SHARED-1-1","dcId":1,"accountId":45,"volumeId":5863,"validBackupUUIDs":["1a56760b-d1c0-4620-8cf7-271951500d70","b6157bc9-085b-4ed6-95c2-4341f31c64bf","1ff967e3-3606-4112-9155-b1145b2ef576","12fbe4e3-1fdd-4c35-a961-0fce07cff584","278e9915-4f94-40c8-bef4-9c6bc82d4653","6fba1dd7-4736-47b3-9eed-148304c0e192","b9d8c9d8-6445-463b-b4e1-ab3b3f3a67a2","40ba5d72-c69a-46c2-973b-0570c1cabeac","774f2b0e-cdaf-4594-a9f9-4f872dcaad6e","8269f50b-6bec-427c-8186-540df6a75dbf","7b0c6e75-40cf-4dd7-826a-09b39f3da7b5","df7eac9c-137a-4655-9d21-d781916351f1","11ec2db1-a2fc-4221-ae1a-c1ab2bd59509","dfc348e1-af50-4d77-b4a0-6e86fc954e1c","98f64c0f-7498-4c70-8b70-beaefd723b45","c42f9dd5-079d-4b77-86dc-c19b7fbed817"],"wait":0}}] } 2013-09-04 03:14:45,722 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) Processing command: com.cloud.agent.api.CleanupSnapshotBackupCommand 2013-09-04 03:14:45,723 DEBUG [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-2:) Executing: mount 2013-09-04 03:14:45,732 DEBUG [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-2:) Execution is successful. 2013-09-04 03:14:45,772 WARN [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-2:) snapshot 8ca9fea4-8a98-4cc3-bba7-cc1dcf32bb24.vhd is not recorded in DB, remove it 2013-09-04 03:14:45,772 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:) Seq 261-1870805144: { Ans: , MgmtId: 345052370018, via: 261, Ver: v1, Flags: 10, [{"Answer":{"result":true,"wait":0}}] } management-server.log. here you see the snapshot being created, the copyToSecStorage process starting, eventually timing out due to the hanging vdi-copy, failing on retrying because vdi in use (although not existing any more the vdi is still know on xen), retrying some more on another HV and eventuall giving up because it tries to create a duplicate SR. 2013-09-04 04:27:10,931 DEBUG [cloud.async.AsyncJobManagerImpl] (Job-Executor-69:job-95137) Executing com.cloud.api.commands.CreateSnapshotCmd for job-95137 2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] (Job-Executor-69:job-95137) Seq 91-780303147: Sending { Cmd , MgmtId: 345052370017, via: 91, Ver: v1, Flags: 100011, [{"ManageSnapshotCommand":{"_commandSwitch":"-c","_volumePath":"9cb7af90-ca88-4b34-aa6f-bc21c3d4a3aa","_pool":{"id":208,"uuid":"b290385b-466d-3243-a939-3d242164e034","host":"mccpnas3-4-vip1.mccp.mcinfra.net","path":"/volumes/pool0/MCCP-S-SBP1-1_MCCP-XEN-1","port":2049,"type":"NetworkFilesystem"},"_snapshotName":"vlstws3_ROOT-2736_20130904022710","_snapshotId":71889,"_vmName":"i-45-2736-VM","wait":0}}] } 2013-09-04 04:27:10,971 DEBUG [agent.transport.Request] (Job-Executor-69:job-95137) Seq 91-780303147: Executing: { Cmd , MgmtId: 345052370017, via: 91, Ver: v1, Flags: 100011, [{"ManageSnapshotCommand":{"_commandSwitch":"-c","_volumePath":"9cb7af90-ca88-4b34-aa6f-bc21c3d4a3aa","_pool":{"id":208,"uuid":"b290385b-466d-3243-a939-3d242164e034","host":"mccpnas3-4-vip1.mccp.mcinfra.net","path":"/volumes/pool0/MCCP-S-SBP1-1_MCCP-XEN-1","port":2049,"type":"NetworkFilesystem"},"_snapshotName":"vlstws3_ROOT-2736_20130904022710","_snapshotId":71889,"_vmName":"i-45-2736-VM","wait":0}}] } 2013-09-04 04:27:12,949 DEBUG [agent.transport.Request] (Job-Executor-69:job-95137) Seq 91-780303147: Received: { Ans: , MgmtId: 345052370017, via: 91, Ver: v1, Flags: 10, { ManageSnapshotAnswer } } 2013-09-04 04:27:12,991 DEBUG [agent.transport.Request] (Job-Executor-69:job-95137) Seq 91-780303148: Sending { Cmd , MgmtId: 345052370017, via: 91, Ver: v1, Flags: 100011, [{"BackupSnapshotCommand":{"isVolumeInactive":false,"vmName":"i-45-2736-VM","snapshotId":71889,"pool":{"id":208,"uuid":"b2903
[jira] [Updated] (CLOUDSTACK-692) The StorageManager-Scavenger deletes snapshots that are still in the process of being created.
[ https://issues.apache.org/jira/browse/CLOUDSTACK-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris van Lieshout updated CLOUDSTACK-692: -- Summary: The StorageManager-Scavenger deletes snapshots that are still in the process of being created. (was: The StorageManager-Scavenger deletes snapshots that are still in the process of being created at that time when the volume has older snapshots that do need scavenging) > The StorageManager-Scavenger deletes snapshots that are still in the process > of being created. > -- > > Key: CLOUDSTACK-692 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-692 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Snapshot >Reporter: Joris van Lieshout >Priority: Minor > > Hi there, > I think we ran into a bug due to a concurrence of circumstances regarding > snapshotting and the cleanup of snapshots. > The StorageManager-Scavenger instructs the StorageVM to delete a snapshot > that is still in the process of being created on a hypervisor at that time > when the volume has older snapshots that do need scavenging. > The SR gets mounted for the snapshot to be created on. > 2012-12-16 08:02:53,831 DEBUG [xen.resource.CitrixResourceBase] > (DirectAgent-293:null) Host 192.168.###.42 > OpaqueRef:fae7f8be-8cf1-7b84-3d30-7202e172b530: Created a SR; UUID is > 1f7530d8-4615-c220-7f37-0 > 5862ddbfe3b device config is > {serverpath=/pool0/-###-dc-1-sec1/snapshots/163/1161, > server=192.168.###.14} > The SMlog on the xenserver show that at this time the snapshot is still > being created. > 2012-12-16 08:37:08,768 DEBUG [agent.transport.Request] > (StorageManager-Scavenger-1:null) Seq 159-1958616345: Sending { Cmd , > MgmtId: 345052433504, via: 159, Ver: v1, Flags: 100011, [{"CleanupSnapshot > BackupCommand":{"secondaryStoragePoolURL":"nfs://192.168.###.14/pool0/-###-dc-1-sec1","dcId":2,"accountId":163,"volumeId":1161,"validBackupUUIDs":["b714a0ee-406e-4100-a75d-bc594391dca9","209bc1dd-f6 > 1a-486c-aecf-335590a907eb"],"wait":0}}] } > At this time we start seeing tapdisk errors on the XenServer indicating > that the vhd file is gone. > Dec 16 08:37:08 vm8 tapdisk[26553]: ERROR: errno -116 at vhd_complete: > /var/run/sr-mount/1f7530d8-4615-c220-7f37-05862ddbfe3b/073893a6-e9cb-4cf6-8070-c6cf771db5d7.vhd: > op: 2, lsec: 448131408, secs: > 88, nbytes: 45056, blk: 109407, blk_offset: 330368935 > Dec 16 08:37:08 vm8 tapdisk[26553]: ERROR: errno -116 at vhd_complete: > /var/run/sr-mount/1f7530d8-4615-c220-7f37-05862ddbfe3b/073893a6-e9cb-4cf6-8070-c6cf771db5d7.vhd: > op: 2, lsec: 448131496, secs: 40, nbytes: 20480, blk: 109407, blk_offset: > 330368935 > Dec 16 08:37:08 vm8 tapdisk[26553]: ERROR: errno -116 at vhd_complete: > /var/run/sr-mount/1f7530d8-4615-c220-7f37-05862ddbfe3b/073893a6-e9cb-4cf6-8070-c6cf771db5d7.vhd: > op: 4, lsec: 448131072, secs: 1, nbytes: 512, blk: 109407, blk_offset: > 330368935 > Dec 16 08:37:08 vm8 tapdisk[26553]: ERROR: errno -116 at > __tapdisk_vbd_complete_td_request: req tap-77.0: write 0x0058 secs @ > 0x1ab5f150 - Stale NFS file handle > Dec 16 08:37:08 vm8 tapdisk[26553]: ERROR: errno -116 at > __tapdisk_vbd_complete_td_request: req tap-77.1: write 0x0028 secs @ > 0x1ab5f1a8 - Stale NFS file handle -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira