Hi, In response to sdague reporting that citycloud jobs were timing out, I investigated the mirror, suspecting it was not providing data fast enough.
There were some 170 htcacheclean jobs running, and the host had a load over 100. I killed all these, but performance was still unacceptable. I suspected networking, but since the host was in such a bad state I decided to reboot it. Unfortunately it would get an address from DHCP but seemed to have DNS issues ... eventually it would ping but nothing else was working. nodepool.o.o was placed in the emergency file and I removed lon1 to avoid jobs going there. I used the citycloud live chat, and Kim helpfully investigated and ended up migrating mirror.lon1.citycloud.openstack.org to a new compute node. This appeared to fix things, for us at least. nodepool.o.o is removed from the emergency file and original config restored. With hindsight, clearly the excessive htcacheclean processes were due to negative feedback of slow processes due to the network/dns issues all starting to bunch up over time. However, I still think we could minimise further issues running it under a lock [1]. Other than that, not sure there is much else we can do, I think this was largely an upstream issue. Cheers, -i [1] https://review.openstack.org/#/c/492481/ _______________________________________________ OpenStack-Infra mailing list OpenStack-Infra@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra