Toolforge just now suffered a partial grid-engine outage. All grid
services should be back to normal as of this email; some k8s services
may misbehave for the next hour or two.
NFS misbehavior resulted in grid control mechanisms timing out, which
meant that no new jobs could be scheduled for the last 90 minutes or so.
We've rebooted the NFS server which has resolved the primary issues;
however, rebooting NFS is itself disruptive and may have caused other
jobs (both on the grid or in k8s) to fail.
We're currently rebooting all k8s worker nodes, which will take a couple
of hours to complete. During those reboots some jobs may fail or
experience surprise rescheduling.
Sorry for the outage! If your grid job was disrupted by this outage,
please take this as a sign to migrate your service off the grid! Details
about the grid shutdown can be found here:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#Timeline
-Andrew (+ Taavi who did most of the actual recovery work)
_______________________________________________
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/