Re: [Labs-l] [Labs-announce] Mild but long-running Tools outage in process, about to get worse!

2017-06-29 Thread Andrew Bogott
After various failed measures, we're now trying to revert back to the 
older kernel and switching back between NFS servers yet again.  So Tools 
NFS (and various associated services) will probably break, at least for 
a few minutes.


With luck this will get us into a stable place, but I'll update again 
regardless.


-Andrew


On 6/29/17 3:27 PM, Andrew Bogott wrote:
The tools cluster is suffering from several maladies right now. 
Existing services seem to be mostly fine, but any kubernetes services 
that tried to restart in the last few hours probably failed to start, 
and new things are still failing to start.  Similarly, web services 
and other tools are failing to restart in several cases.


There are various theories as to what's going on -- most likely 
it's a kernel-version incompatibility with the newly upgraded NFS 
server.  There was an earlier ldap outage which is better understood 
and should be resolved by now.


We apologize for the inconvenience, and are working frantically to 
restore stability.  There will be a follow-up email when things are 
resolved.


-Andrew





___
Labs-announce mailing list
labs-annou...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-announce
___
Labs-l mailing list
Labs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l


[Labs-l] [Labs-announce] Mild but long-running Tools outage in process

2017-06-29 Thread Andrew Bogott
The tools cluster is suffering from several maladies right now. 
Existing services seem to be mostly fine, but any kubernetes services 
that tried to restart in the last few hours probably failed to start, 
and new things are still failing to start.  Similarly, web services and 
other tools are failing to restart in several cases.


There are various theories as to what's going on -- most likely 
it's a kernel-version incompatibility with the newly upgraded NFS 
server.  There was an earlier ldap outage which is better understood and 
should be resolved by now.


We apologize for the inconvenience, and are working frantically to 
restore stability.  There will be a follow-up email when things are 
resolved.


-Andrew



___
Labs-announce mailing list
labs-annou...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-announce
___
Labs-l mailing list
Labs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l


Re: [Labs-l] [Labs-announce] [Labs][Tools] Security reboots for NFS Servers

2017-06-29 Thread Madhumitha Viswanathan
The first set of reboots(for servers that power /home and /data/project
across NFS enabled Labs instances, including Tools) will begin now.

On Tue, Jun 27, 2017 at 9:33 PM, Madhumitha Viswanathan <
mviswanat...@wikimedia.org> wrote:

> Hi all,
>
> We need to reboot the NFS Servers - starting with the servers that power
> /home and /data/project across NFS enabled Labs instances, including Tools.
> This is scheduled to happen at 15:00 UTC on Thursday, 29 June 2017. The
> outage should be a short window of a few minutes, and there may be some
> intermittent failures in accessing these shares, and job scheduling on the
> grid, but service should ideally get restored quickly.
>
> We'll then reboot the server that powers /data/scratch, /public/dumps and
> maps (on all Labs instances including tools). This is scheduled to happen
> at 17:00 UTC on Thursday, 29 June 2017. These shares will be unavailable
> for a short time while the server is being rebooted.
>
> Apologies for any inconvenience caused, I will update the list when the
> reboots are done. Feel free to reach out to us on #wikimedia-cloud for any
> questions or concerns.
>
> --
> Madhumitha Viswanathan
> Operations Engineer, Cloud Services
>



-- 
--Madhu :)
___
Labs-announce mailing list
labs-annou...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-announce
___
Labs-l mailing list
Labs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l