Sowmya Krishnan created CLOUDSTACK-2918: -------------------------------------------
Summary: In a scaled up environment, hosts fail to come up after Management server restart in clustered set up Key: CLOUDSTACK-2918 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-2918 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Components: Management Server Affects Versions: 4.2.0 Environment: Load Test environment with simulator Reporter: Sowmya Krishnan Fix For: 4.2.0 Scaled setup with 20K simulated hosts and as many VMs, Basic Zone, Clustered setup with 3 management servers. After restarting one or all the Management servers, re balancing hosts takes very long time, and the rebalanced hosts don't come to Up state either. Following is the error message from logs when the MS was starting up: 2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.2, msid: 206915885094132 2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.66, msid: 206915885097283 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Receive cluster alert, EventArgs: com.cloud.cluster.ClusterNodeJoinEventArgs 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.2, msidL: 206915885094132 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.66, msidL: 206915885097283 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Management server node 10.223.48.66 is up, send alert 2013-05-31 00:00:22,539 WARN [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notifying management server join event took 178 ms 2013-05-31 00:00:25,825 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 5583ms, profilerH eartbeatUpdate: Done. Duration: 187ms, profilerPeerScan: Done. Duration: 560ms, profilerAgentLB: Done. Duration: 4836ms 2013-05-31 00:00:25,963 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Detected management node joined, id:3, nodeIP:10.223.48.130 2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notify management server node join to listeners. 2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.130, msid: 206915885093830 2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Receive cluster alert, EventArgs: com.cloud.cluster.ClusterNodeJoinEventArgs 2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.130, msidL: 206915885093830 2013-05-31 00:00:25,964 WARN [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notifying management server join event took 0 ms 2013-05-31 00:00:27,501 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1675ms, profilerH eartbeatUpdate: Done. Duration: 135ms, profilerPeerScan: Done. Duration: 179ms, profilerAgentLB: Done. Duration: 1361ms 2013-05-31 00:00:29,389 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1888ms, profilerH eartbeatUpdate: Done. Duration: 462ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1424ms 2013-05-31 00:00:33,912 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1643ms, profilerH eartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1501ms 2013-05-31 00:00:41,328 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1675ms, profilerH eartbeatUpdate: Done. Duration: 178ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1495ms 2013-05-31 00:00:48,799 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Running Capacity Checker ... 2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) recalculating system capacity 2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing cpu/ram capacity update 2013-05-31 00:00:48,883 DEBUG [network.router.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null) Found 0 routers to update status. 2013-05-31 00:00:48,901 DEBUG [network.router.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null) Found 0 networks to update RvR status. 2013-05-31 00:00:50,031 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1596ms, profilerHeartbeatUpdate: Done. Duration: 179ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1416ms 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:2 usedCpu: 500 reservedCpu: 0 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem: 268435456 reservedMem: 0 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500 reservedCpu: 0 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem: 1073741824 reservedMem: 0 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0 reservedCpu: 0 ... ... ... 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem: 268435456 reservedMem: 0 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500 reservedCpu: 0 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem: 1073741824 reservedMem: 0 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0 reservedCpu: 0 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:5 usedMem: 0 reservedMem: 0 2013-05-31 00:00:50,414 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:6 usedCpu: 0 reservedCpu: 0 ... ... ... 2013-05-31 00:03:05,249 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:20002 usedMem: 0 reservedMem: 0 2013-05-31 00:03:05,249 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done executing cpu/ram capacity update 2013-05-31 00:03:05,250 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing storage capacity update 2013-05-31 00:03:05,890 DEBUG [cloud.storage.StorageManagerImpl] (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity type - 9 , DataCenterId - 1, HostOrPoolId - 20 0, PodId 1 2013-05-31 00:03:06,080 DEBUG [cloud.storage.StorageManagerImpl] (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity type - 9 , DataCenterId - 1, HostOrPoolId - 20 1, PodId 1 2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done executing storage capacity update 2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing capacity updates for public ip and Vlans 2013-05-31 00:04:33,488 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1565ms, profilerHeartbeatUpdate: Done. Duration: 122ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1441ms 2013-05-31 00:04:35,070 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1582ms, profilerHeartbeatUpdate: Done. Duration: 172ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1408ms 2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done capacity updates for public ip and Vlans 2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing capacity updates for private ip 2013-05-31 00:04:37,155 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2085ms, profilerHeartbeatUpdate: Done. Duration: 158ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1925ms ... ... ... 2013-05-31 01:56:10,384 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2105ms, profilerHeartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1963ms 2013-05-31 01:56:12,043 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1659ms, profilerHeartbeatUpdate: Done. Duration: 120ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1537ms 2013-05-31 01:56:14,172 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2129ms, profilerHeartbeatUpdate: Done. Duration: 96ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 2031ms All 3 Management servers have the same above message and the hosts continue to be in disconnected state mysql> select count(*), status, type, mgmt_server_id from host group by mgmt_server_id, status, type; +----------+--------------+--------------------+-----------------+ | count(*) | status | type | mgmt_server_id | +----------+--------------+--------------------+-----------------+ | 1 | Alert | SecondaryStorage | NULL | | 17854 | Disconnected | Routing | NULL | | 1 | Disconnected | SecondaryStorageVM | NULL | | 53 | Disconnected | Routing | 206915885093830 | | 1485 | Disconnected | Routing | 206915885094132 | | 608 | Disconnected | Routing | 206915885097283 | +----------+--------------+--------------------+-----------------+ 6 rows in set (0.07 sec) Observed this issue even with lesser hosts ~5K -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira