[ https://issues.apache.org/jira/browse/CLOUDSTACK-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sowmya Krishnan updated CLOUDSTACK-4371: ---------------------------------------- Attachment: agenttaskpool_334.log > [Performance Testing] Basic zone with 20K Hosts, management server restart > leaves the hosts in disconnected state for very long time > ------------------------------------------------------------------------------------------------------------------------------------ > > Key: CLOUDSTACK-4371 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4371 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Management Server > Affects Versions: 4.2.0 > Environment: Basic zone, with over 20K simulator hosts > Reporter: Sowmya Krishnan > Labels: performance > Fix For: 4.2.0 > > Attachments: agenttaskpool_334.log, ms1_restartfail.log.gz, > ms2_restartfail.log.gz, ms3_restartfail.log.gz > > > Basic zone performance test bed: > 20K simulator hosts, > 3 Management servers > 1 host/cluster > Local storage > Java heap size: 12GB > db.cloud.maxActive=2000 > direct.agent.load.size=1000 > agent.lb.enabled=true > Deploy around 20K simulator hosts with 3 Management servers clustered > (Not deployed any VMs yet) > After all hosts are deployed, stop all 3 Management servers and then start > all 3 one after another > Result > ===== > Hosts don't get to connected state at all even after 10 minutes. While around > 2K of them go into alert state while rest are in disconnected state. > mysql> select count(*), status, resource_state, type, mgmt_server_id from > host group by mgmt_server_id, status, type, resource_state; > +----------+--------------+----------------+--------------------+----------------+ > | count(*) | status | resource_state | type | > mgmt_server_id | > +----------+--------------+----------------+--------------------+----------------+ > | 1946 | Alert | Enabled | Routing | > NULL | > | 18054 | Disconnected | Enabled | Routing | > NULL | > | 1 | Disconnected | Enabled | SecondaryStorageVM | > NULL | > +----------+--------------+----------------+--------------------+----------------+ > 3 rows in set (0.11 sec) > MS Logs show lot of storage pool exceptions while hosts try to get connected: > 2013-08-16 05:49:25,592 DEBUG [agent.transport.Request] > (AgentTaskPool-12:null) Seq 13-32440322: Sending { Cmd , MgmtId: > 206915885094132, via: 13, Ver: v1, Flags: 100011, [{"com.cloud.agen > t.api.CleanupNetworkRulesCmd":{"interval":2028,"wait":0}}] } > 2013-08-16 05:49:25,592 DEBUG [agent.transport.Request] > (AgentTaskPool-12:null) Seq 13-32440322: Executing: { Cmd , MgmtId: > 206915885094132, via: 13, Ver: v1, Flags: 100011, [{"com.cloud.a > gent.api.CleanupNetworkRulesCmd":{"interval":2028,"wait":0}}] } > 2013-08-16 05:49:25,592 DEBUG [xen.discoverer.XcpServerDiscoverer] > (AgentTaskPool-14:null) Not XenServer so moving on. > 2013-08-16 05:49:25,592 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-14:null) Sending Connect to listener: > DeploymentPlanningManagerImpl_EnhancerByCloudStack_76f3d8e4 > 2013-08-16 05:49:25,591 DEBUG [cloud.resource.AgentResourceBase] > (ClusteredAgentManager Timer:null) Deserializing simulated agent on reconnect > 2013-08-16 05:49:25,594 INFO [network.security.SecurityGroupListener] > (AgentTaskPool-12:null) Scheduled network rules cleanup, interval=2028 > 2013-08-16 05:49:25,594 INFO [network.security.SecurityGroupListener] > (AgentTaskPool-12:null) Received a host startup notification > 2013-08-16 05:49:25,595 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-12:null) Sending Connect to listener: StoragePoolMonitor > ... > ... > 2013-08-16 05:49:25,761 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-12:null) Sending Connect to listener: > ClusteredVirtualMachineManagerImpl_EnhancerByCloudStack_b5459b7b > 2013-08-16 05:49:25,764 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (AgentTaskPool-12:null) Found 0 VMs for host 13 > 2013-08-16 05:49:25,765 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-12:null) Sending Connect to listener: LocalStoragePoolListener > 2013-08-16 05:49:25,768 DEBUG > [datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl] > (AgentTaskPool-12:null) createPool Params @ scheme - Filesystem storageHost - > 172.1.3.131 hostPath - /mnt/2a2463b4-4fd2-4ac7-ad3f-040a3046e478 port - -1 > 2013-08-16 05:49:25,771 DEBUG > [datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl] > (AgentTaskPool-12:null) Another active pool with the same uuid already exists > 2013-08-16 05:49:25,772 WARN [cloud.storage.StorageManagerImpl] > (AgentTaskPool-12:null) Unable to setup the local storage pool for > Host[-13-Routing] > com.cloud.utils.exception.CloudRuntimeException: Another active pool with the > same uuid already exists > at > org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.initialize(CloudStackPrimaryDataStoreLifeCycleImpl.java:319) > at > com.cloud.storage.StorageManagerImpl.createLocalStorage(StorageManagerImpl.java:647) > at > com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125) > at > com.cloud.storage.LocalStoragePoolListener.processConnect(LocalStoragePoolListener.java:86) > at > com.cloud.agent.manager.AgentManagerImpl.notifyMonitorsOfConnection(AgentManagerImpl.java:587) > at > com.cloud.agent.manager.AgentManagerImpl.handleDirectConnectAgent(AgentManagerImpl.java:1479) > at > com.cloud.resource.ResourceManagerImpl.createHostAndAgent(ResourceManagerImpl.java:1739) > at > com.cloud.resource.ResourceManagerImpl.createHostAndAgent(ResourceManagerImpl.java:1901) > at > com.cloud.agent.manager.AgentManagerImpl$SimulateStartTask.run(AgentManagerImpl.java:1130) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:679) > 2013-08-16 05:49:25,773 INFO [utils.exception.CSExceptionErrorCode] > (AgentTaskPool-12:null) Could not find exception: > com.cloud.exception.ConnectionException in error code list for exceptions > 2013-08-16 05:49:25,773 WARN [agent.manager.AgentManagerImpl] > (AgentTaskPool-12:null) Monitor LocalStoragePoolListener says there is an > error in the connect process for 13 due to Unable to setup the local storage > pool for Host[-13-Routing] > 2013-08-16 05:49:25,773 INFO [agent.manager.AgentManagerImpl] > (AgentTaskPool-12:null) Host 13 is disconnecting with event AgentDisconnected -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira