Hello,

so this morning, the zuul queues for sfstack jobs were all queued, here
is how the repair went:

* First check the jenkins master interface, there was no sfstack slave
  attached
* Then look at "nova list", there was plenty of sfstack running
* But "nodepool list" showed them as "deleted"

So nodepool lost tracked of slave and it was unable to spawn new
instances because of tenant quota overflow.

To fix such situation, here are the command we run:

# Remove all sfstack instances from nova:
nova delete $(nova list | awk '/sfstack/ { print $2 }')
# Remove all unused floating ip from neutron:
INSTANCE_LOCAL_IP="10.0.0"
for i in  $(nova floating-ip-list | grep -v $INSTANCE_LOCAL_IP |
        awk '{ print $4 }'); do
        nova floating-ip-delete $i
done
# Flush instance from nodepool db:
mysql nodepool -e "delete from node where label_name like '%sfstack%'"


And that was it, slave got respawn and jobs are running as expected.

Now it's not normal nodepool "lost track of slave", and indeed in
nodepool.log there are quite a few error:

ERROR nodepool.NodeDeleter: Exception deleting node 5563:
Traceback (most recent call last):
  File "nodepool/nodepool.py", line 368, in run
    node = session.getNode(self.node_id)
  File "nodepool/nodepool.py", line 1828, in _deleteNode
    # Don't write to the session if not needed.
  File "nodepool/nodepool.py", line 2254, in updateStats
    key = 'nodepool.label.%s.nodes.%s' % (
KeyError: 'nodepool.label.sfstack-selenium-centos-7.nodes.used'


And that remains to be investigated...

Cheers,
-Tristan

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Softwarefactory-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/softwarefactory-dev

Reply via email to