Re: [Labs-l] [Labs-announce] Mild but long-running Tools outage in process, about to get worse!
After various failed measures, we're now trying to revert back to the older kernel and switching back between NFS servers yet again. So Tools NFS (and various associated services) will probably break, at least for a few minutes. With luck this will get us into a stable place, but I'll update again regardless. -Andrew On 6/29/17 3:27 PM, Andrew Bogott wrote: The tools cluster is suffering from several maladies right now. Existing services seem to be mostly fine, but any kubernetes services that tried to restart in the last few hours probably failed to start, and new things are still failing to start. Similarly, web services and other tools are failing to restart in several cases. There are various theories as to what's going on -- most likely it's a kernel-version incompatibility with the newly upgraded NFS server. There was an earlier ldap outage which is better understood and should be resolved by now. We apologize for the inconvenience, and are working frantically to restore stability. There will be a follow-up email when things are resolved. -Andrew ___ Labs-announce mailing list labs-annou...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-announce ___ Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l
[Labs-l] [Labs-announce] Mild but long-running Tools outage in process
The tools cluster is suffering from several maladies right now. Existing services seem to be mostly fine, but any kubernetes services that tried to restart in the last few hours probably failed to start, and new things are still failing to start. Similarly, web services and other tools are failing to restart in several cases. There are various theories as to what's going on -- most likely it's a kernel-version incompatibility with the newly upgraded NFS server. There was an earlier ldap outage which is better understood and should be resolved by now. We apologize for the inconvenience, and are working frantically to restore stability. There will be a follow-up email when things are resolved. -Andrew ___ Labs-announce mailing list labs-annou...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-announce ___ Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l
Re: [Labs-l] [Labs-announce] [Labs][Tools] Security reboots for NFS Servers
The first set of reboots(for servers that power /home and /data/project across NFS enabled Labs instances, including Tools) will begin now. On Tue, Jun 27, 2017 at 9:33 PM, Madhumitha Viswanathan < mviswanat...@wikimedia.org> wrote: > Hi all, > > We need to reboot the NFS Servers - starting with the servers that power > /home and /data/project across NFS enabled Labs instances, including Tools. > This is scheduled to happen at 15:00 UTC on Thursday, 29 June 2017. The > outage should be a short window of a few minutes, and there may be some > intermittent failures in accessing these shares, and job scheduling on the > grid, but service should ideally get restored quickly. > > We'll then reboot the server that powers /data/scratch, /public/dumps and > maps (on all Labs instances including tools). This is scheduled to happen > at 17:00 UTC on Thursday, 29 June 2017. These shares will be unavailable > for a short time while the server is being rebooted. > > Apologies for any inconvenience caused, I will update the list when the > reboots are done. Feel free to reach out to us on #wikimedia-cloud for any > questions or concerns. > > -- > Madhumitha Viswanathan > Operations Engineer, Cloud Services > -- --Madhu :) ___ Labs-announce mailing list labs-annou...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-announce ___ Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l