[Cloud-announce] [Toolforge][GRID SHUTDOWN] Toolforge Grid Engine has been shutdown

2024-03-14 Thread Bryan Davis
As of 2024-03-14T11:02 UTC the Toolforge Grid Engine service has been
shutdown.[0][1]

This shutdown is the culmination of a final migration process from
Grid Engine to Kubernetes that started in in late 2022.[2] Arturo
wrote a blog post in 2022 that gives a detailed explanation of why we
chose to take on the final shutdown project at that time.[3] The roots
of this change go back much further however to at least August of 2015
when Yuvi Panda posted to the labs-l list about looking for more
modern alternatives to the Grid Engine platform.[4]

Some tools have been lost and a few technical volunteers have been
upset as many of us have striven to meet a vision of a more secure,
performant, and maintainable platform for running the many critical
tools hosted by the Toolforge project. I am deeply sorry to each of
you who have been frustrated by this change, but today I stand to
celebrate the collective work and accomplishment of the many humans
who have helped imagine, design, implement, test, document, maintain,
and use the Kubernetes deployment and support systems in Toolforge.

Thank you to the past and present members of the Wikimedia Cloud
Services team. Thank you to the past and present technical volunteers
acting as Toolforge admins. Thank you to the many, many Toolforge tool
maintainers who use the platform, ask for new capabilities, and help
each other make ever better software for the Wikimedia movement. Thank
you to the folks who who will keep moving the Toolforge project and
other technical spaces in the Wikimedia movement forward for many,
many years to come.


[0]: https://sal.toolforge.org/log/DrOgPI4BGiVuUzOd9I1b
[1]: https://wikitech.wikimedia.org/wiki/Obsolete:Toolforge/Grid
[2]: 
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#Timeline
[3]: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/
[4]: https://lists.wikimedia.org/pipermail/labs-l/2015-August/003955.html

Bryan, on behalf of the Toolforge administrators
-- 
Bryan DavisWikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808
___
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/


[Cloud-announce] ORES API endpoints deprecated; Lift Wing API endpoints available as replacement

2023-08-03 Thread Bryan Davis
The Foundation's Machine Learning team has announced [0] that users of
ORES models should begin migrating their code to the newer Lift Wing
system. They have prepared a page on Wikitech to help bot and tool
maintainers understand what steps will be needed and where they can go
to ask for help with the process [1].

[0]: 
https://lists.wikimedia.org/hyperkitty/list/wikitec...@lists.wikimedia.org/thread/EK65B7QCQHEG37C2ERPIUSP64OX3ZEUJ/
[1]: https://wikitech.wikimedia.org/wiki/ORES

Bryan
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808
___
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/


[Cloud-announce] [Toolforge] Self-service tool deletion finally arrives!

2022-10-31 Thread Bryan Davis
TL;DR:
* https://toolsadmin.wikimedia.org now allows marking a tool as "disabled".
* Disabling a tool will immediately stop any running jobs including
webservices and prevent maintainers from logging in as the tool.
* Disabled tools are archived and deleted after 40 days.
* Disabled tools can be re-enabled at any time prior to being archived
and deleted.

"How can I delete a tool that I no longer want?" is a question that
folks have been asking for a very long time. I know of Phabricator
tasks going back to at least April 2016 [0] tracking such requests. A
bit over 5 years ago I created a Phabricator task to track figuring
out how to delete an unused tool [1]. Nearly 18 months ago Andrew
Bogott started to look into how we could automate the checklist of
cleanup steps that had been developed. By January 2022 Andrew had
implemented all of the pieces needed complete the checklist. This came
with a command line tool that Toolforge admins have been able to use
to delete a tool. Today we have released updates to Striker
(<https://toolsadmin.wikimedia.org>) which finally expose a "disable
tool" button to a tool's maintainers [2].

When a tool is marked as disabled any running jobs it has on the Grid
Engine or Kubernetes backends are stopped. Changes are also made so
that new jobs cannot be started, any crontab file is archived, and
maintainers are prevented from using `become `. Normally things
stay in this state for 40 days to give everyone a chance to change
their minds and re-enable to tool. Once the 40 day timer expires, the
system will proceed with cleanup tasks that are more difficult to
reverse including archiving and deleting the tool's $HOME and ToolsDB
databases. Ultimately the tool's group and user are deleted from the
LDAP directory which functionally completes the process.

A lot of system administration tasks are kind of boring, but this work
turned out to be actually pretty interesting. A Toolforge tool can
include quite a number of different parts. There can be jobs running
on the Grid Engine and/or Kubernetes, a crontab to start jobs
periodically, a database in ToolsDB, credentials for accessing the
Wiki Replicas, credentials for accessing the Toolforge Elasticsearch
cluster, a $HOME directory on the Toolforge NFS server, and account
information in the LDAP directory that powers Developer accounts and
Cloud VPS credentials. All of these things would ideally be removed
when a tool was successfully deleted. Some of them are things that we
would like to create historical archives of incase someone wanted to
recreate the tool's functionality. And in a perfect world we would
also be able to change our minds and start the tool back up if things
had not progressed to fully deleting the tool.

Andrew came up with a fairly elegant system to deal with this
complexity. He designed a series of processes which are each
responsible for a slice of the overall complexity. A process running
on the Grid controller is responsible for stopping running Grid Engine
jobs and changing the tool's quota so that no new jobs can be started.
A process running on the Crontab server archives the tool's crontab
configuration. A process running on the Kubernetes controller deletes
the tool's credentials for accessing the Kubernetes cluster, the
tool's namespace, and by extension removes all processes running in
the namespace. A process running on the NFS controller archives the
tool's $HOME directory contents and deletes the directory. It also
removes the tool from other LDAP membership lists (a tool can be a
co-maintainer of another tool) and deletes the tool's user and group
from the LDAP directory. A process archives ToolsDB tables. Another
process removes the tool's database credentials across the ToolsDB and
Wiki Replicas server pools. Many of these processes are implemented in
cloud/toolforge/disable-tool on Gerrit [3]. Others were added to
existing management controllers for creating Kubernetes and database
credentials. The processes all take cues from the LDAP directory and
tracking files in the tool's $HOME to create an eventually consistent,
decoupled collection of clean up actions.

We still have some work to do to update documentation on wikitech and
Phabricator so that folks know where to find the new buttons. If you
find documentation that needs to be updated before someone else gets
to it, please feel empowered to be [[WP:BOLD]] and update them.

[0]: https://phabricator.wikimedia.org/T133777
[1]: https://phabricator.wikimedia.org/T170355
[2]: https://phabricator.wikimedia.org/T285403
[3]: 
https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/disable-tool/
[[WP:BOLD]]: https://en.wikipedia.org/wiki/Wikipedia:Be_bold

Bryan, on behalf of the Toolforge administration team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]

[Cloud-announce] Re: [Toolforge] Moving git repos from Diffusion to GitLab starting 2022-09-06

2022-09-07 Thread Bryan Davis
On Tue, Sep 6, 2022 at 4:38 PM Bryan Davis  wrote:
>
> On Thu, Sep 1, 2022 at 4:26 PM Bryan Davis  wrote:
> >
> > What: Diffusion git hosting moving to GitLab [0][1]
> > When: Tuesday 2022-09-06 between 15:00 - 23:00 UTC
> > Why: Unblocking sunsetting work for Differential/Diffusion [2]
> >
> > What you can do: If you have not logged into
> > https://gitlab.wikimedia.org/ yet to attach your Developer account and
> > look around, now would be a great time!
> >
> >
> > Toolforge tool maintainers can use Striker
> > (<https://toolsadmin.wikimedia.org/>) to create git repositories for
> > each of their tools. Today these git repositories are hosted by
> > <https://phabricator.wikimedia.org/diffusion/>. Starting on 2022-09-06
> > new repositories will be hosted under
> > <https://gitlab.wikimedia.org/toolforge-repos> instead.
> >
> > We will also be migrating the 474 existing Striker created
> > repositories from Diffusion to GitLab starting on Tuesday 2022-09-06.
> > This process will involve making each existing Diffusion repository
> > read-only, copying it to GitLab, and finally configuring the Diffusion
> > repo to be a read-only mirror of the GitLab repo. We hope that this
> > set of operations will be the least disruptive way to migrate the
> > repositories to the new hosting platform.
> >
> > For tool maintainers with git repos that are migrating who *do not*
> > yet have their Developer account attached at
> > https://gitlab.wikimedia.org/, GitLab will send an email invitation to
> > join the new repo. Because of some quirks of the login process that we
> > are using for our GitLab service, the link in this email needs to be
> > used *after* you have attached your account in order to grant you
> > access [3].
> >
> >
> > [0]: https://phabricator.wikimedia.org/T296893
> > [1]: https://phabricator.wikimedia.org/T315706
> > [2]: https://phabricator.wikimedia.org/T191182
> > [3]: https://phabricator.wikimedia.org/T313366#8203450
>
> The application has been updated to create new git repositories at
> https://gitlab.wikimedia.org instead of in Phabricator. A few bugs
> have been found in the migration process for the now legacy Diffusion
> repos. I will continue to work through these issues during my work day
> tomorrow.
>
> NOTE: the "Git repositories" section of a tool's information in
> Striker now only displays GitHub hosted repositories. Do not panic if
> you look at a detail screen like
> <https://toolsadmin.wikimedia.org/tools/id/replag> and a Diffusion
> repo is no longer listed. We still have record of the Diffusion repos
> in the database. The GitLab replacement will show up as soon as it has
> been migrated.

Migration of the existing Diffusion repositories to GitLab is now
complete. In the end 347 of the original 474 repositories were
migrated. The other 127 repositories were in some way invalid for
migration. See comments in https://phabricator.wikimedia.org/T315706
if you are interested in the details.

Bryan
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808
___
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/


[Cloud-announce] Re: [Toolforge] Moving git repos from Diffusion to GitLab starting 2022-09-06

2022-09-06 Thread Bryan Davis
On Thu, Sep 1, 2022 at 4:26 PM Bryan Davis  wrote:
>
> What: Diffusion git hosting moving to GitLab [0][1]
> When: Tuesday 2022-09-06 between 15:00 - 23:00 UTC
> Why: Unblocking sunsetting work for Differential/Diffusion [2]
>
> What you can do: If you have not logged into
> https://gitlab.wikimedia.org/ yet to attach your Developer account and
> look around, now would be a great time!
>
>
> Toolforge tool maintainers can use Striker
> (<https://toolsadmin.wikimedia.org/>) to create git repositories for
> each of their tools. Today these git repositories are hosted by
> <https://phabricator.wikimedia.org/diffusion/>. Starting on 2022-09-06
> new repositories will be hosted under
> <https://gitlab.wikimedia.org/toolforge-repos> instead.
>
> We will also be migrating the 474 existing Striker created
> repositories from Diffusion to GitLab starting on Tuesday 2022-09-06.
> This process will involve making each existing Diffusion repository
> read-only, copying it to GitLab, and finally configuring the Diffusion
> repo to be a read-only mirror of the GitLab repo. We hope that this
> set of operations will be the least disruptive way to migrate the
> repositories to the new hosting platform.
>
> For tool maintainers with git repos that are migrating who *do not*
> yet have their Developer account attached at
> https://gitlab.wikimedia.org/, GitLab will send an email invitation to
> join the new repo. Because of some quirks of the login process that we
> are using for our GitLab service, the link in this email needs to be
> used *after* you have attached your account in order to grant you
> access [3].
>
>
> [0]: https://phabricator.wikimedia.org/T296893
> [1]: https://phabricator.wikimedia.org/T315706
> [2]: https://phabricator.wikimedia.org/T191182
> [3]: https://phabricator.wikimedia.org/T313366#8203450

The application has been updated to create new git repositories at
https://gitlab.wikimedia.org instead of in Phabricator. A few bugs
have been found in the migration process for the now legacy Diffusion
repos. I will continue to work through these issues during my work day
tomorrow.

NOTE: the "Git repositories" section of a tool's information in
Striker now only displays GitHub hosted repositories. Do not panic if
you look at a detail screen like
<https://toolsadmin.wikimedia.org/tools/id/replag> and a Diffusion
repo is no longer listed. We still have record of the Diffusion repos
in the database. The GitLab replacement will show up as soon as it has
been migrated.


Bryan
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808
___
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/


[Cloud-announce] [Toolforge] Moving git repos from Diffusion to GitLab starting 2022-09-06

2022-09-01 Thread Bryan Davis
What: Diffusion git hosting moving to GitLab [0][1]
When: Tuesday 2022-09-06 between 15:00 - 23:00 UTC
Why: Unblocking sunsetting work for Differential/Diffusion [2]

What you can do: If you have not logged into
https://gitlab.wikimedia.org/ yet to attach your Developer account and
look around, now would be a great time!


Toolforge tool maintainers can use Striker
(<https://toolsadmin.wikimedia.org/>) to create git repositories for
each of their tools. Today these git repositories are hosted by
<https://phabricator.wikimedia.org/diffusion/>. Starting on 2022-09-06
new repositories will be hosted under
<https://gitlab.wikimedia.org/toolforge-repos> instead.

We will also be migrating the 474 existing Striker created
repositories from Diffusion to GitLab starting on Tuesday 2022-09-06.
This process will involve making each existing Diffusion repository
read-only, copying it to GitLab, and finally configuring the Diffusion
repo to be a read-only mirror of the GitLab repo. We hope that this
set of operations will be the least disruptive way to migrate the
repositories to the new hosting platform.

For tool maintainers with git repos that are migrating who *do not*
yet have their Developer account attached at
https://gitlab.wikimedia.org/, GitLab will send an email invitation to
join the new repo. Because of some quirks of the login process that we
are using for our GitLab service, the link in this email needs to be
used *after* you have attached your account in order to grant you
access [3].


[0]: https://phabricator.wikimedia.org/T296893
[1]: https://phabricator.wikimedia.org/T315706
[2]: https://phabricator.wikimedia.org/T191182
[3]: https://phabricator.wikimedia.org/T313366#8203450


Bryan, on behalf of the Toolforge administrators
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808
___
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/


[Cloud-announce] [Toolforge] Toolforge tool maintainers are eligible to vote in current Board election

2022-08-23 Thread Bryan Davis
Voting in the 2022 Wikimedia Foundation Board of Trustees election is
now open: 
<https://meta.wikimedia.org/wiki/Wikimedia_Foundation_elections/2022/Community_Voting>

Per the eligibility rules [0], many technical contributors to the
Wikimedia movement are eligible to vote. In this election round
proactive work has been done by Taavi and others to ensure that
Toolforge maintainers with known SUL accounts are already on the
eligible voter list [1].

[0]: 
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_elections/2022/Voter_eligibility_guidelines#Developers
[1]: https://phabricator.wikimedia.org/T309754

Bryan
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808
___
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/


[Cloud-announce] [Toolforge] Some tools broken by upstream Let's Encrypt certificate changes

2021-09-30 Thread Bryan Davis
TL;DR:
* Let's Encrypt [0] TLS certificates are "signed" by "root"
certificates to create a chain of trust
* The oldest "root" signing certificate for LE certs (DST Root CA X3)
expired on 2021-09-30 [1]
* Deprecated Toolforge Kubernetes containers only knew this root
certificate and not the newer root certificate (ISRG Root X1)
* Update your tool to a newer container to fix

We are starting to hear reports of tools that suddenly stopped working
on 2021-09-30. The common issue is accessing the APIs for Wikimedia
wikis.

The Wikimedia wikis use multiple TLS certificates issued by different
providers for redundancy and protection against a problem with a
single certificate provider. One of the certificate providers that we
use is Let's Encrypt (LE) [0]. LE certificates are themselves signed
by multiple "root" certificates to create a chain of trust that your
web browser or other TLS verifying software can trust. The oldest root
certificate (named "DST Root CA X3") used to sign the LE certificates
expired on 2021-09-30 [1]. Very old operating systems and some
compiled software do not have the newer root certificate (named "ISRG
Root X1") in their trusted certificate collection. These systems are
now rejecting LE certificates.

In Toolforge, we think that this mainly affects tools running on the
Kubernetes cluster inside Debian Jessie based containers. Specifically
the "php5.6", "python", "python2", and "ruby2" containers are expected
to have issues with the LE certificate expiration based on what we
have found so far. Recommended replacement containers are "php7.4",
"python3.9", and "ruby25".

We also have reports of `mono` on the bastions + grid engine failing.
We do not yet have a fix for this. It will require us to compile and
install a newer version of mono for everyone who is using it.

Interested folks can follow progress of our infrastructure updates in
response to this issue at T291387 [3].

[0]: https://letsencrypt.org/
[1]: https://letsencrypt.org/docs/dst-root-ca-x3-expiration-september-2021/
[2]: https://phabricator.wikimedia.org/T291387

Bryan
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808
___
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/


[Cloud-announce] Re: #wikimedia-cloud IRC channel available on libera.chat

2021-05-26 Thread Bryan Davis
On Thu, May 20, 2021 at 2:59 PM Bryan Davis  wrote:
>
> TL;DR:
> * The #wikimedia-cloud IRC channel is moving from Freenode to Libera.Chat.
> * Register an account on Libera.Chat and join us there!
>
> [...snip...]
>
> A new #wikimedia-cloud channel has been created on irc.libera.chat for
> this Wikimedia sub-community to use. The old channel on Freenode still
> exists and will be maintained at least until we can get all the bots
> moved, our documentation updated on wikitech, and we see more folks on
> the Libera channel than the Freenode one.

#wikimedia-cloud and related channels (-feed, -admin) are now closed
on Freenode.

If you are an IRC user and you haven't joined us in the
#wikimedia-cloud channel on Libera Chat yet, come on over and join us!
See <https://meta.wikimedia.org/wiki/IRC/Migrating_to_Libera_Chat>
instructions on getting your Libera Chat account created.


Bryan, on behalf of the WMCS team and the Cloud VPS and Toolforge admins
--
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808
___
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/


[Cloud-announce] [Toolforge] Outbound emails were stuck from 2021-03-31T14:56Z to 2021-04-20T21:52Z

2021-04-20 Thread Bryan Davis
TL;DR:
* We messed up when replacing the mail server in Toolforge
* We didn't notice that we had messed up for nearly 3 weeks
* Toolforge servers should be able to send outbound email again now

We have been working to replace some of the Cloud VPS instances in the
Toolforge project with new instances running Debian Buster
(<https://phabricator.wikimedia.org/T275864>). One step in this
process was to replace the mail server instance that handles all
outbound mail.

We setup a new mail server on 2021-03-31, but missed an important
configuration step of telling the rest of the instances in the
Toolforge project to use the new server when sending outgoing mail. A
Toolforge user reported on irc at 2021-04-20T21:11Z that they had not
received expected emails from their tool recently. Investigation
revealed the broken configuration and work started to correct the
problem. Around 2021-04-20T21:52Z we deployed the correct mail relay
host configuration. Over the next 30 minutes or so this configuration
update rolled out across the Toolforge instances, re-enabling outbound
mail sending. Around 2021-04-20T22:20Z we ran commands to instruct all
Toolforge instances to "unfreeze" emails which were queued for sending
but marked as "frozen" due to the prior invalid configuration.

Emails are now being sent out as expected. We apologize for the
interruption in service. We will also be looking into some active
monitoring system for outbound email delivery to catch problems
similar to this more quickly in the future.

Bryan, on behalf of the Toolforge admin team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Cloud VPS] TLS encryption fully enforced for *.wmflabs.org & *.wmcloud.org

2021-02-02 Thread Bryan Davis
On Tue, Aug 18, 2020 at 9:03 AM Bryan Davis  wrote:
>
> TL;DR:
> * HTTP -> HTTPS redirection is live (finally!)
> * Currently allowing a "POST loophole"
> * "POST loophole" will be closed on 2021-02-01
>
> Today we merged a small change [0] to the front proxy used by Cloud
> VPS projects [1]. This change brings automatic HTTP -> HTTPS
> redirection to the "domain proxy" service and a
> Strict-Transport-Security header with a 1 day duration.
>
> The current configuration is conservative. We will only redirect GET
> and HEAD requests to HTTPS to avoid triggering bugs in the handling of
> redirects during POST requests. This "POST loophole" is the same
> process that we followed when converting the production wiki farm and
> Toolforge to HTTPS.
>
> When we announced similar changes for Toolforge in 2019 [2] we forgot
> to set a timeline for closing the POST loophole. This time we are
> wiser! We will close the POST loophole and make all HTTP requests,
> regardless of the verb used, redirect to HTTPS on 2021-02-01. This 6
> month transition period should give us all a chance to find and update
> URLs to use https and to fix any dependent software that might break
> if a redirect was sent for a POST request.
>
> If you find issues in your projects resulting from this change, please
> do let us know. The tracking task for this change is T120486 [3]. We
> also provide support in the #wikimedia-cloud channel on Freenode and
> via the cl...@lists.wikimedia.org mailing list [4].
>
>
> [0]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/620122/
> [1]: 
> https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach_Cloud_VPS_servers_from_the_internet
> [2]: 
> https://phabricator.wikimedia.org/phame/post/view/132/migrating_tools.wmflabs.org_to_https/
> [3]: https://phabricator.wikimedia.org/T120486
> [4]: https://lists.wikimedia.org/mailman/listinfo/cloud

TL;DR:
* "POST loophole" closed per prior announcement on 2020-08-18
* 366 day Strict-Transport-Security header sent with all HTTPS responses

I am very happy to announce that today we have closed the "POST
loophole" for our *.wmflabs.org & *.wmcloud.org proxy layer [5]. This
is a follow up to the announcement of partial TLS enforcement by the
Cloud VPS front proxies on 2020-08-18.

There is a possibility that closing the POST loophole will break some
clients accessing services running in Cloud VPS behind the front
proxies. Specifically, POST actions sent to HTTP (not HTTPS) URLs will
now return a 301 Moved Permanently response to the same URL with the
scheme changed to https. The HTTP specifications are ambiguous about
how this response should be handled which means that implementations
in various browsers and libraries may or may not re-POST the original
payload to the new URL. The best fix we can suggest for this is
updating links and forms to always use HTTPS URLs.

If you find issues in your projects resulting from this change, please
do let us know. The tracking task for this change is T120486 [6]. We
also provide support in the #wikimedia-cloud channel on Freenode and
via the cl...@lists.wikimedia.org mailing list [7].

[5]: https://gerrit.wikimedia.org/r/661140
[6]: https://phabricator.wikimedia.org/T120486
[7]: https://lists.wikimedia.org/mailman/listinfo/cloud

Bryan, on behalf of the Cloud VPS admin team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] Clarification: database service names still use eqiad.wmflabs, not eqiad1.wikimedia.cloud

2020-09-11 Thread Bryan Davis
Several people have asked on IRC and Phabricator if the deprecation of
*.wmflabs names for Cloud VPS instances means that the service names
used to connect to ToolsDB and the Wiki Replicas are changing. The
answer is no, these names are not changing yet. We will be replacing
these service names eventually, but for now they are staying in the
wmflabs pseudo domain.


In 2017 [0] we established new canonical service names for accessing
the shared database servers for the Cloud Services environment. Those
names are still the same today.

The naming convention for connecting to the Wiki Replica servers is:
".{analytics,web}.db.svc.eqiad.wmflabs". The
*.web.db.svc.eqiad.wmflabs service names are intended for queries that
need a real-time response. The *.analytics.db.svc.eqiad.wmflabs
service names are intended for batch jobs and long running queries.
See the announcement from 2017 for more details [0].

The preferred service name for ToolsDB is tools.db.svc.eqiad.wmflabs.

[0]: 
https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/

Bryan, on behalf of the Cloud Services team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] New features in toolsadmin.wikimedia.org

2020-09-02 Thread Bryan Davis
https://toolsadmin.wikimedia.org has been updated to a new version of
Striker [0]. This is a feature release that carries some quality of
life improvements for tool maintainers and updates for recent changes
to Toolforge webservice URLs [1]:

* Allow self-service creation of Phabricator projects for Tools
* Allow tool maintainers to delete toolinfo records
* Improved explanation of toolinfo fields
* Use *.toolforge.org URLs when generating toolinfo data

The killer feature in this list is self-service Phabricator project
creation! This action is available from the details page for any tool
right under the prior option for creating Diffusion repositories.

Many, many thanks to Taavi Väänänen (User:Majavah) for writing the
code for Phabricator project creation. Even more thanks for their
patience in waiting for code review and deployment.

[0]: https://wikitech.wikimedia.org/wiki/Striker
[1]: 
https://wikitech.wikimedia.org/wiki/Toolsadmin.wikimedia.org/Deployments#2020-09-02

Bryan, on behalf of the Toolforge admin team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] Default /favicon.ico and /robots.txt for toolforge.org webservices

2020-08-26 Thread Bryan Davis
A change was just now made to the shared proxy system for Toolforge
which makes the proxy respond with default content for /favicon.ico
and /robots.txt when a tool's webservice returns a 404 Not Found
response for these files.

The default /favicon.ico is the same as
<https://tools-static.wmflabs.org/toolforge/favicons/favicon.ico>.

The default robots.txt denies access to all compliant web crawlers. We
decided that this "fail closed" approach would be safer than a "fail
open" telling all crawlers to crawl all tools. Any tool that does wish
to be indexed by search engines and other crawlers can serve their own
/robots.txt content. Please see <https://www.robotstxt.org/> for more
information on /robots.txt in general.

These changes fix a regression [0] in functionality caused by the
toolforge.org migration and the introduction of the 2020 Kubernetes
ingress layer. Previously the /robots.txt and /favicon.ico from the
"admin" tool were served for all tools due to the use of a shared
hostname.

[0]: https://phabricator.wikimedia.org/T251628

Bryan, on behalf of the Toolforge admin team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Cloud VPS] TLS encryption partially enforced for *.wmflabs.org & *.wmcloud.org

2020-08-18 Thread Bryan Davis
TL;DR:
* HTTP -> HTTPS redirection is live (finally!)
* Currently allowing a "POST loophole"
* "POST loophole" will be closed on 2021-02-01

Today we merged a small change [0] to the front proxy used by Cloud
VPS projects [1]. This change brings automatic HTTP -> HTTPS
redirection to the "domain proxy" service and a
Strict-Transport-Security header with a 1 day duration.

The current configuration is conservative. We will only redirect GET
and HEAD requests to HTTPS to avoid triggering bugs in the handling of
redirects during POST requests. This "POST loophole" is the same
process that we followed when converting the production wiki farm and
Toolforge to HTTPS.

When we announced similar changes for Toolforge in 2019 [2] we forgot
to set a timeline for closing the POST loophole. This time we are
wiser! We will close the POST loophole and make all HTTP requests,
regardless of the verb used, redirect to HTTPS on 2021-02-01. This 6
month transition period should give us all a chance to find and update
URLs to use https and to fix any dependent software that might break
if a redirect was sent for a POST request.

If you find issues in your projects resulting from this change, please
do let us know. The tracking task for this change is T120486 [3]. We
also provide support in the #wikimedia-cloud channel on Freenode and
via the cl...@lists.wikimedia.org mailing list [4].


[0]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/620122/
[1]: 
https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach_Cloud_VPS_servers_from_the_internet
[2]: 
https://phabricator.wikimedia.org/phame/post/view/132/migrating_tools.wmflabs.org_to_https/
[3]: https://phabricator.wikimedia.org/T120486
[4]: https://lists.wikimedia.org/mailman/listinfo/cloud

Bryan, on behalf of the Cloud VPS admin team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Cloud VPS] Puppet labs/private.git data loss incident affecting some projects

2020-06-04 Thread Bryan Davis
At 2020-06-04T11:12 UTC a change was merged to the
operations/puppet.git repository which resulted in data loss for Cloud
VPS projects using a local Puppetmaster
(role::puppetmaster::standalone). The specific data loss is removal of
any local to the Puppetmaster instance commits overlaid on the
upstream labs/private.git repository. These patches would have
contained passwords, ssh keys, TLS certificates, and similar
authentication information for Puppet managed configuration.

The majority of Cloud VPS projects are not affected by this
configuration data loss. Several highly used and visible projects,
including Toolforge (tools) and Beta Cluster (deployment-prep), have
some impact. We have disabled Puppet across all Cloud VPS instances
that were reachable by our central command and control service (cumin)
and are currently evaluating impact and recovering data from
/var/logs/puppet.log change logs where available.

More information will be collected at
<https://phabricator.wikimedia.org/T254491> and an incident report
will also be prepared once the initial response is complete.

Bryan
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] Disk failure on cloudvirt1004 OpenStack host

2020-04-21 Thread Bryan Davis
cloudvirt1004 is one of our oldest generation of hypervisor servers.
The hypervisor servers are the machines which actually run the virtual
machine instances for Cloud VPS projects. This physical host is
experiencing an active hard disk and/or RAID controller failure. The
Cloud Services team is actively attempting to fix the server and
evacuate all instances running on it to other hypervisors.

See <https://phabricator.wikimedia.org/T250869> for more information
and progress updates.


The following projects and instances are affected:

* cloudvirt-canary
** canary1004-01.cloudvirt-canary.eqiad.wmflabs

* commonsarchive
** commonsarchive-mwtest.commonsarchive.eqiad.wmflabs

* deployment-prep
** deployment-echostore01.deployment-prep.eqiad.wmflabs
** deployment-schema-2.deployment-prep.eqiad.wmflabs

* incubator
** incubator-mw.incubator.eqiad.wmflabs

* machine-vision
** visionoid.machine-vision.eqiad.wmflabs

* ogvjs-integration
** media-streaming.ogvjs-integration.eqiad.wmflabs

* services
** Esther-outreachy-intern.services.eqiad.wmflabs

* shiny-r
** discovery-testing-02.shiny-r.eqiad.wmflabs

* tools
** tools-k8s-worker-38.tools.eqiad.wmflabs
** tools-k8s-worker-52.tools.eqiad.wmflabs
** tools-sgeexec-0901.tools.eqiad.wmflabs
** tools-sgewebgrid-lighttpd-0918.tools.eqiad.wmflabs
** tools-sgewebgrid-lighttpd-0919.tools.eqiad.wmflabs

* toolsbeta
** toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad.wmflabs

* wikidata-autodesc
** wikidata-autodesc.wikidata-autodesc.eqiad.wmflabs

* wikilink
** wikilink-prod.wikilink.eqiad.wmflabs


Bryan, on behalf of the Cloud VPS admins and Cloud Services team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] 2020 Kubernetes cluster migration complete

2020-03-06 Thread Bryan Davis
On 2020-03-03, Brooke completed the automatic migration phase of the
2020 Kubernetes migration by moving the last workloads from the legacy
Kubernetes cluster to the 2020 Kubernetes cluster [0].

All Toolforge tools using `webservice --backend=kubernetes ...` and/or
manually maintained Kubernetes objects are now running on the 2020
Kubernetes cluster. The Toolforge admin team is in the process of
tearing down the legacy cluster and cleaning up various documentation
and tooling related to it [1].

This project involved a lot of hard work that most of the Toolforge
community did not see. Brooke and Arturo started planning things over
a year ago [2] to ensure that the Toolforge admin team would be able
to complete this migration with a minimum amount of disruption to
tools and their maintainers. Along the journey they researched
Kubernetes best practices and recommendations, read and re-read
numerous tutorial and how-to docs, and designed a completely new
process to automate the deployment of Kubernetes in Toolforge. They
also sought and received help from other Toolforge admins, Wikimedia
Foundation staff, and technical volunteers. This was a truly
collaborative effort.

I am very happy to say that in my opinion we have a well automated and
monitored Kubernetes cluster in Toolforge today. There are many more
features that we will continue to work on as we try to make Kubernetes
use in Toolforge easier for everyone, but we can only do that work
because we now have this solid base to build on. I look forward to
announcements of many more features in the coming months.

Thank you to our alpha and beta testers who found more edge cases and
made good suggestions for simplifying things. Thank you all for your
patience and understanding when things did not go quite as planned
during this process. And finally thank you in advance for the edits
that will be made to help pages on Wikitech and elsewhere as we all
work on bug #1 (improving documentation).


[0]: https://phabricator.wikimedia.org/T246519
[1]: https://phabricator.wikimedia.org/T246689
[2]: https://phabricator.wikimedia.org/T214513

Bryan, on behalf of the Toolforge admin team and the Wikimedia Cloud
Services team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] 2020 Kubernetes cluster automatic migration phase beginning

2020-02-20 Thread Bryan Davis
Following a beta testing period [0] and a general use self-migration
period [1], the Toolforge administration team is ready to begin the
final phase of automatic migration of tools currently running on the
legacy Kubernetes cluster to the 2020 Kubernetes cluster.

The migration process will involve Toolforge administrators running
`webservice migrate` for each tool in the same way that self-migration
happens [2]. A small number of tools are using the legacy Kubernetes
cluster outside of the `webservice` system. These tools will be moved
using a more manual process after move all webservices. We are
currently planning on doing these migrations in several batches so
that we can monitor the load and capacity of the 2020 Kubernetes
cluster as we move ~640 more tools over from the legacy cluster.

Once the tools have all been moved to the 2020 cluster, we will
continue with additional clean up and default configuration changes
which will allow us to fully decommission the legacy cluster. We will
also be updating various documentation on Wikitech during this final
phase. We hope to complete this entire process by 2020-03-06 at the
latest.


[0]: 
https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000247.html
[1]: 
https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000252.html
[2]: 
https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration#Manually_migrate_a_webservice_to_the_new_cluster

Bryan (on behalf of the Toolforge admins and the Cloud Services team)
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] 2020 Kubernetes cluster open for general use

2020-01-24 Thread Bryan Davis
The Toolforge admins would like to invite all Toolforge Kubernetes
users to begin migration of their tools to the 2020 Kubernetes
cluster. Instructions for migration and other details are on Wikitech
[0].

Timeline:
* 2020-01-09: 2020 Kubernetes cluster available for beta testers on an
opt-in basis
* 2020-01-24: 2020 Kubernetes cluster general availability for
migration on an opt-in basis
* 2020-02-10: Automatic migration of remaining workloads from 2016
cluster to 2020 cluster by Toolforge admins

We announced beta testing for this new cluster on 2020-01-09 [1].
Since then more than 70 tools have migrated, with approximately 110
tools now using it [2]. The Toolforge admins have also fixed a few
small issues that our early testers noticed. We are now ready and
excited to have many more tools move their workloads from the legacy
Kubernetes cluster over to the new 2020 Kubernetes cluster.

Thanks to Legoktm, Magnus, and others who helped during the beta
testing phase by trying things out and reporting issues that they
found.

For most tools the migration requires a small number of manual steps [0]:
* webservice stop
* kubectl config use-context toolforge
* alias kubectl=/usr/bin/kubectl; echo "alias
kubectl=/usr/bin/kubectl" >> $HOME/.profile
* webservice --backend=kubernetes [TYPE] start

This could also be a good opportunity for tools to upgrade to newer
language runtimes such as php7.3 and python3.7. See the list on
Wikitech [3] for currently available types. When upgrading to a new
runtime, do not forget to rebuild Python virtual environments, NPM
packages, or Composer packages if you are using them as well.

[0]: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration
[1]: 
https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000247.html
[2]: https://tools.wmflabs.org/k8s-status/
[3]: 
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes#Available_container_types

Bryan (on behalf of the Toolforge admins and the Cloud Services team)
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] New Kubernetes cluster open for beta testers

2020-01-09 Thread Bryan Davis
I am happy to announce that a new and improved Kubernetes cluster is
now available for use by beta testers on an opt-in basis. A page has
been created on Wikitech [0] outlining the self-service migration
process.

Timeline:
* 2020-01-09: 2020 Kubernetes cluster available for beta testers on an
opt-in basis
* 2020-01-23: 2020 Kubernetes cluster general availability for
migration on an opt-in basis
* 2020-02-10: Automatic migration of remaining workloads from 2016
cluster to 2020 cluster by Toolforge admins

This new cluster has been a work in progress for more than a year
within the Wikimedia Cloud Services team, and a top priority project
for the past six months. About 35 tools, including
https://tools.wmflabs.org/admin/, are currently running on what we are
calling the "2020 Kubernetes cluster". This new cluster is running
Kubernetes v1.15.6 and Docker 19.03.4. It is also using a newer
authentication and authorization method (RBAC), a new ingress routing
service, and a different method of integrating with the Developer
account LDAP service. We have built a new tool [1] which makes the
state of the Kubernetes cluster more transparent and on par with the
information that we already expose for the grid engine cluster [2].

With a significant number of tools managed by Toolforge administrators
already migrated to the new cluster, we are fairly confident that the
basic features used by most Kubernetes tools are covered. It is likely
that a few outlying issues remain to be found as more tools move, but
we have confidence that we can address them quickly. This has led us
to propose a fairly short period of voluntary beta testing, followed
by a short general availability opt-in migration period, and finally a
complete migration of all remaining tools which will be done by the
Toolforge administration team for anyone who has not migrated their
self.

Please help with beta testing if you have some time and are willing to
get help on irc, Phabricator, and the cl...@lists.wikimedia.org
mailing list for early adopter issues you may encounter.

I want to publicly praise Brooke Storm and Arturo Borrero González for
the hours that they have put into reading docs, building proof of
concept clusters, and improving automation and processes to make the
2020 Kubernetes cluster possible. The Toolforge community can look
forward to more frequent and less disruptive software upgrades in this
cluster as a direct result of this work. We have some other feature
improvements in planning now that I think you will all be excited to
see and use later this year!

[0]: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration
[1]: https://tools.wmflabs.org/k8s-status/
[2]: https://tools.wmflabs.org/sge-status/

Bryan (on behalf of the Toolforge admins and the Cloud Services team)
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Wiki replicas] query killer time limit reduced to 1 hour for .analytics.db.svc.eqiad.wmflabs

2019-10-08 Thread Bryan Davis
The .analytics.db.svc.eqiad.wmflabs database servers have been
experiencing some stability issues in the last two to three weeks that
we have reason to believe are related to query volume. The DBA team at
the Wikimedia Foundation is looking into various changes that may help
with these problems including software upgrades for our MariaDB
deployments.

Today we took an initial step of reducing the maximum time allowed for
a query to complete on the .analytics.db.svc.eqiad.wmflabs
hosts to 1 hour. We were using an upper limit of 4 hours previously.
Our hope is that this change will relieve some stress on the shared
servers and allow us more time to look into other changes to restore
stability. Ideally we will be able to increase the limit again after
making other changes to these systems.

Bryan, on behalf of the Cloud Services team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] Lots and lots of cronspam, probably safe to ignore

2019-10-07 Thread Bryan Davis
We had issues with NFS mounts in Toolforge as an unintended side
effect of OpenStack upgrades earlier today. While NFS mounts were
broken, many (all?) cronjobs failed. The emails telling maintainers
about these failures did not go out in real time because the mail
system also relies on NFS. When we fixed the mail server thousands of
queued messages started going out. There are still a few emails
sending as I write this, but the queue is almost empty.

As far as we can tell, cronjobs are starting as expected now and the
emails are just left over noise.

Bryan
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] Debian Jessie deprecation plans

2019-09-11 Thread Bryan Davis
On June 30, 2020 the Debian project will stop providing security patch
support for the Debian 8 "Jessie" release. The Cloud Services and SRE
teams at the Wikimedia Foundation would like to have all usage of
Debian Jessie in our managed networks replaced with newer versions of
Debian's operating system on or ideally well before that date.

A page has been created on Wikitech [0] with an initial timeline for
the removal of all Debian Jessie instances from Cloud VPS projects.
This timeline follows roughly the same schedule as we used in 2018
when deprecating Ubuntu Trusty in Cloud VPS projects:

* September 2019: Announce the initiative via this email and the Wikitech page
* October 2019: Start actively contacting instance maintainers who
need to migrate to a new OS
* November & December 2019: Continue to work with instance maintainers
to migrate to a new OS
* January 2020: Shutdown remaining Debian Jessie instances

If you know that your Cloud VPS project is using Debian Jessie, you
can get a head start on migrating your instances to Debian Buster
(preferred) or Stretch by visiting the Wikitech page and reading the
instructions there.

If you are a concerned Toolforge user, stay tuned for future
announcements about changes that will be made as the Toolforge admin
team works to remove Debian Jessie from that environment. For now
there is nothing an individual Tool maintainer needs to do.

[0]: https://wikitech.wikimedia.org/wiki/News/Jessie_deprecation

Bryan - on behalf of the Cloud VPS admin team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] Toolforge Kubernetes disrupted from 2019-09-10T18:54 to 2019-09-11T01:30

2019-09-10 Thread Bryan Davis
We need to do a proper incident report, but I wanted to send out a
(late) notice that the Toolforge Kubernetes cluster was at best
degraded and at worst completely broken from 2019-09-10T18:54 to
2019-09-11T01:30.

The TL;DR is that some change, likely part of T171188: Move the main
WMCS puppetmaster into the Labs realm, tricked Puppet into installing
an old version of the x509 signing cert used to secure communication
between the etcd cluster and kube-apiserver. This manifested in an
alert from our monitoring system of the Kubernetes api being broken.
When investigating that alert we found that the kube-apiserver was
unable to connect to its paired etcd cluster. The etcd cluster seemed
to be flapping internally (status showing good, then failed, then good
again). Diagnosing the cause of this flapping resulted in a complete
failure of the etcd cluster. Restoring the etcd cluster was a long and
difficult task. Once etcd was recovered, it took about 1.5 more hours
to find the cause and fix for the initial communication errors (the
wrong x509 signing certificate). It is currently unclear if the x509
misconfiguration also caused the etcd cluster failure, or if that was
an unrelated and unfortunate coincidence.

See https://phabricator.wikimedia.org/T232536 for follow up
documentation (when we write it during the coming US business day).

Bryan - on behalf of the Toolforge admin team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] DNS errors early 2019-09-09 UTC may have affected jobs and servers

2019-09-08 Thread Bryan Davis
The DNS recursor servers which are used from inside Cloud VPS and
Toolforge to resolve both internal and external hostnames to IP
address were not functional from approximately 2019-09-09T00:51 UTC to
2019-09-09T01:35 UTC. During this time, most (if not all) DNS lookups
would have returned a "SERVFAIL" response. The issue appears to be
resolved now.

We will share more information about what happened and how the problem
was corrected when we are sure that doing so will not cause additional
issues.

Bryan, on behalf of the Cloud VPS admin team
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] Docker images used by Toolforge's Kubernetes cluster updated

2019-08-29 Thread Bryan Davis
Today I rebuilt the Docker images that are used by the `webservice
--backend=kubernetes` command. This is actually a normal thing that we
do periodically in Toolforge to ensure that security patches are
applied in the containers. This round of updates was a bit different
however in that it is the first time the Debian Jessie based images
have been rebuilt since the upstream Debian project removed the
'jessie-backports' apt repo.

Everything should be fine, but if you see weirdness when restarting a
webservice or other Kubernetes pod that looks like it could be related
to software in the Docker image please let myself or one of the
Toolforge admins know by either filing a Phabricator bug report or for
faster response joining the #wikimedia-cloud IRC channel on Freenode
and sending a "!help " message to the channel explaining your
issue.

Bryan - on behalf of the Toolforge admins
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Mediawiki-api-announce] BREAKING CHANGE: Improved timestamp support

2019-06-21 Thread Bryan Davis
Cross-posting from the mediawiki-api-annou...@lists.wikimedia.org list.

-- Forwarded message -
From: Brad Jorsch (Anomie) 
Date: Fri, Jun 21, 2019 at 8:30 AM
Subject: [Mediawiki-api-announce] BREAKING CHANGE: Improved timestamp support
To: 


An upgrade to the timestamp library used by MediaWiki is resulting in
two changes to the handling of timestamp inputs to the action API.
There will be no change to timestamps output by the API.

All of these changes should be deployed to Wikimedia wikis with 1.34.0-wmf.10.

Historically MediaWiki has ignored timezones in supported formats that
include timestamps, treating them as if the timezone specified were
UTC. In the future, specified timezones will be honored (and converted
to UTC).

Historically some invalid formats were accepted, such as
"2019-05-22T12:00:00.1257" or "Wed, 22 May 2019 12:00:00 A
potato". Due to improved validation, these will no longer be accepted.

Support for ISO 8601 and other formats has also been improved. See
https://www.mediawiki.org/wiki/Timestamp for details on the formats
that will be supported.
___
Mediawiki-api-announce mailing list
mediawiki-api-annou...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] Node.js 10 and Python 3.5 types now available for use by Kubernetes webservices

2019-05-19 Thread Bryan Davis
Good news from the Wikimedia Hackathon in Prague! We now have some
newer language runtimes for Node.js and Python3 available for
Kubernetes webservices. These newer versions match the versions that
were added for grid engine webservices when we upgraded to Debian
Stretch.

These new versions are available in parallel with the older Node.js
6.11 and Python 3.4 versions. This will be the pattern used in the
future when we add all newer language runtime versions so that
migrations are a bit easier for all existing users. The new type names
are:

* node10
* python3.5

== Node.js 10 ==

  $ webservice --backend=kubernetes node10 shell
  Defaulting container name to interactive.
  Use 'kubectl describe pod/interactive -n bd808-test' to see all of
the containers in this pod.
  If you don't see a command prompt, try pressing enter.
  $ nodejs --version
  v10.4.0
  $ npm --version
  6.5.0
  $ logout
  Session ended, resume using 'kubectl attach interactive -c
interactive -i -t' command when the   pod is running
  Pod stopped. Session cannot be resumed.

== Python 3.5 ==

  $ webservice --backend=kubernetes python3.5 shell
  Defaulting container name to interactive.
  Use 'kubectl describe pod/interactive -n bd808-test' to see all of
the containers in this pod.
  If you don't see a command prompt, try pressing enter.
  $ python3 --version
  Python 3.5.3
  $ logout
  Session ended, resume using 'kubectl attach interactive -c
interactive -i -t' command when the pod is running
  Pod stopped. Session cannot be resumed.


Bryan, on behalf of the Toolforge admin team
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Breaking change] Important for Wikidata tools maintainers: wb_terms table to be dropped at the end of May

2019-04-24 Thread Bryan Davis
The 'wb_terms' table is being removed from the Wiki Replica databases.
Please see Léa Lacroix's post on the wikidata mailing list [0] for
additional details.

TL;DR summary:
* May-June 2019, the Wikidata development team will drop the wb_terms
table from the database in favor of a new optimized schema
* Migration will start on 2019-05-29
* A test system will be available starting 2019-05-15
* Details are available in Phabricator [1]

[0]: https://lists.wikimedia.org/pipermail/wikidata/2019-April/012987.html
[1]: https://phabricator.wikimedia.org/T221764

Bryan
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] Change to Wikitech logins: Username now case-sensitive

2019-04-15 Thread Bryan Davis
A change was deployed to the Wikitech config 2019-04-15T23:16 UTC
which prevents users from logging into the wiki with a username that
differs in case from the 'cn' value for their developer account.

This change is not expected to cause problems for most users, but
there may be some people who have historically entered a username with
mismatched case (for example "bryandavis" instead of "BryanDavis") and
relied on MediaWiki and the LdapAuthentication plugin figuring things
out. This will no longer happen automatically. These users will need
to update their password managers (or brains if they are not using a
password manager) to supply the username with correct casing.

The "wrongpassword" error message on Wikitech has been updated with a
local override to help people discover this problem. See
<https://phabricator.wikimedia.org/T165795> for more details.

Bryan, on behalf of the Cloud Services team
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] Trusty job grid has been shutdown

2019-03-25 Thread Bryan Davis
The legacy Ubuntu Trusty grid engine job grid has been shutdown!
Thanks to everyone who was involved in migrating existing tools from
the old grid to the Kubernetes cluster or the new Debian Stretch job
grid.

There were still 385 tools that may have been running jobs or
webservices on the Trusty grid at the time of shutdown. A static list
of these tools is preserved at
<https://tools.wmflabs.org/trusty-tools/>.

Instructions are still available at
<https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation>
for migrating tools that are currently down.

Bryan, on behalf of the Toolforge admin team
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


Re: [Cloud-announce] [Toolforge] Ubuntu Trusty job grid will be shutdown Monday 2019-03-25

2019-03-25 Thread Bryan Davis
On Fri, Mar 22, 2019 at 2:38 PM Bryan Davis  wrote:
>
> As previously announced on this list [0][1] we are in the process of
> replacing the old Ubuntu Trusty instances in Toolforge with fancy new
> Debian Stretch instances.
>
> This process is reaching its next major milestone on Monday
> 2019-03-25. During the general US workday on that date (14:00-00:00
> UTC) the Toolforge admin team will be dismantling the legacy Ubuntu
> Trusty job grid. Any tools that have not migrated to either the
> Stretch grid or the Kubernetes cluster at that point will be forcibly
> shutdown. Nothing will be deleted in the tools' $HOME directories, but
> any Trusty grid jobs will be stopped. Any crontab file remaining on
> the
> old grid's cron server will be archived as
> "$HOME/crontab.trusty.save". Maintainers who somehow missed all of the
> announcements will be able to login and restart their tools on the
> Stretch grid or Kubernetes.
>
> See <https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation>
> for additional information and tips on common problems that have been
> found thus far.
>
> [0]: 
> https://lists.wikimedia.org/pipermail/cloud-announce/2019-January/000122.html
> [1]: 
> https://lists.wikimedia.org/pipermail/cloud-announce/2019-March/000142.html

The steps of shutting down the Trusty grid are starting now.

As documented above, the remaining crontab files will be archived to
each tool's $HOME directory to make restoring function for tools which
have not migrated yet easier.

Bryan
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] Ubuntu Trusty job grid will be shutdown Monday 2019-03-25

2019-03-22 Thread Bryan Davis
As previously announced on this list [0][1] we are in the process of
replacing the old Ubuntu Trusty instances in Toolforge with fancy new
Debian Stretch instances.

This process is reaching its next major milestone on Monday
2019-03-25. During the general US workday on that date (14:00-00:00
UTC) the Toolforge admin team will be dismantling the legacy Ubuntu
Trusty job grid. Any tools that have not migrated to either the
Stretch grid or the Kubernetes cluster at that point will be forcibly
shutdown. Nothing will be deleted in the tools' $HOME directories, but
any Trusty grid jobs will be stopped. Any crontab file remaining on
the
old grid's cron server will be archived as
"$HOME/crontab.trusty.save". Maintainers who somehow missed all of the
announcements will be able to login and restart their tools on the
Stretch grid or Kubernetes.

See <https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation>
for additional information and tips on common problems that have been
found thus far.

[0]: 
https://lists.wikimedia.org/pipermail/cloud-announce/2019-January/000122.html
[1]: https://lists.wikimedia.org/pipermail/cloud-announce/2019-March/000142.html

Bryan, on behalf of the Toolforge admin team
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] Toolforge: Bastion changes and Trusty deprecation final steps

2019-03-07 Thread Bryan Davis
As announced previously on this list [0] we are in the process of
replacing the old Ubuntu Trusty instances in Toolforge with fancy new
Debian Stretch instances.

== Remaining timeline ==
* Week of 2019-03-04: Switch login.tools.wmflabs.org to point to Stretch bastion
* Week of 2019-03-25: Shutdown Trusty grid

The DNS entry for "login.tools.wmflabs.org" will be updated to point
to a Debian Stretch bastion rather than the old Ubuntu Trusty bastion
soon (like right after I send this email). This change will cause many
ssh clients to alert about a change in the ssh host fingerprint.
Updated fingerprints will be posted on wikitech [1][2] once the switch
has been made.

The legacy Ubuntu Trusty bastion will still be reachable as
"login-trusty.tools.wmflabs.org" until that instance is deleted during
the week of 2019-03-25.

In just over 2 weeks we will be shutting down the Trusty grid for
good. Any tools that have not migrated to either the Stretch grid or
the Kubernetes cluster at that point will be forcibly shutdown.
Nothing will be deleted in the tools' $HOME directories, but any
Trusty grid jobs will be stopped. Any crontab file remaining on the
old grid's cron server will be archived as
"$HOME/crontab.trusty.save". Maintainers who somehow missed all of the
announcements will be able to login and restart their tools on the
Stretch grid or Kubernetes.

See <https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation>
for additional information and tips on common problems that have been
found thus far.

[0]: 
https://lists.wikimedia.org/pipermail/cloud-announce/2019-January/000122.html
[1]: 
https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/login.tools.wmflabs.org
[2]: 
https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/tools-dev.wmflabs.org

Bryan, on behalf of the Toolforge admin team
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] DSA (ssh-dss) SSH keys deprecated for Cloud VPS and Toolforge users

2019-03-01 Thread Bryan Davis
Openssh 7.0, released 2015-08-11, deprecated the use of DSA (ssh-dss)
keys and RSA keys smaller than 1024 bits [0]. We have been applying
some backwards compatibility configuration changes to ssh bastion
servers in both Cloud VPS and Toolforge for some time to continue to
support old keys using these deprecated algorithms. I was supposed to
announce this to the community about 1.5 years ago, but apparently I
did not [1].

We have noticed with the introduction of Debian Stretch ssh bastion
servers running Openssh 7.4 that users with DSA keys (and possibly
short RSA keys) are being denied access by the newer software. The
easiest fix for this is for users to generate new keys and upload
their new public key using the form at
<https://toolsadmin.wikimedia.org/profile/settings/ssh-keys> or
<https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack>.

We currently recommend using either ed25519 or 4096-bit RSA keys. See
<https://wikitech.wikimedia.org/wiki/Production_shell_access#Generating_your_SSH_key>
for more information.


[0]: https://www.openssh.com/txt/release-7.0
[1]: https://phabricator.wikimedia.org/T168433

Bryan, on behalf of the Wikimedia Cloud Services team
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] [Toolforge] max_user_connections = 20 set for ToolsDB

2019-02-25 Thread Bryan Davis
During the flurry of activity we had recently in diagnosing and fixing
problems with the shared ToolsDB MariaDB service [0], we made a
configuration change to place a hard limit on the maximum number of
simultaneous connections permitted for each user account [1][2].

The current limit is set at 20 concurrent connections. This should not
cause any problems for a typical webservice or single script using
ToolsDB, but tools making heavy use of ToolsDB may need to make some
adjustments.

As always, tool maintainers can seek advice on dealing with this limit
or other issues in Toolforge from the Toolforge administration team
and others in the community via our Freenode IRC channel
(#wikimedia-cloud), Phabricator tasks, and the
cl...@lists.wikimedia.org mailing list.


[0]: https://phabricator.wikimedia.org/T216208
[1]: https://phabricator.wikimedia.org/T216170
[2]: 
https://mariadb.com/kb/en/library/server-system-variables/#max_user_connections

Bryan, on behalf of the Toolforge administration team
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] Toolforge: Trusty deprecation and grid engine migration

2019-01-11 Thread Bryan Davis
Ubuntu Trusty was released in April 2014, and support for it
(including security updates) will cease in April 2019. We need to shut
down all Trusty hosts before the end of support date to ensure that
Toolforge remains a secure platform. This migration will take several
months because many people still use the Trusty hosts and our users
are working on tools in their spare time.

== Initial timeline ==
Subject to change, see Wikitech[0] for living timeline.

* 2019-01-11: Availability of Debian Stretch grid announced to community
* Week of 2019-02-04: Weekly reminders via email to tool maintainers
for tools still running on Trusty
* Week of 2019-03-04:
** Daily reminders via email to tool maintainers for tools still
running on Trusty
** Switch login.tools.wmflabs.org to point to Stretch bastion
* Week of 2019-03-18: Evaluate migration status and formulate plan for
final shutdown of Trusty grid
* Week of 2019-03-25: Shutdown Trusty grid

== What is changing? ==
* New job grid running Son of Grid Engine on Debian Stretch instances
* New limits on concurrent job execution and job submission by a single tool
* New bastion hosts running Debian Stretch with connectivity to the new job grid
* New versions of PHP, Python2, Python3, and other language runtimes
* New versions of various support libraries

== What should I do? ==
The Cloud Services team has created the Toolforge Trusty
deprecation[1] page on wikitech.wikimedia.org to document basic steps
needed to move webservices, cron jobs, and continuous jobs from the
old Trusty grid to the new Stretch grid. That page also provides more
details on the language runtime and library version changes and will
provide answers to common problems people encounter as we find them.
If the answer to your problem isn't on the wiki, ask for help in the
#wikimedia-cloud IRC channel or file a bug in Phabricator[2].


[0]: 
https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Timeline
[1]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation
[2]: 
https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?title=Stretch+grid+problem%3A+(your+description)==While+migrating+(my+tool)+to+the+Stretch+job+grid+I+ran+into+this+problem%3A=toolforge%2C+cloud-services-team=triage

Bryan
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] Addshore and Legoktm granted Toolforge admin rights

2018-07-29 Thread Bryan Davis
I am happy to announce that Addshore and Legoktm have been granted
admin (root) level privileges in the Toolforge project. Both are long
time users of Toolforge and prolific contributors to Wikimedia's
technical spaces. One of the projects they are hoping to help us with
is bringing newer versions of PHP to the Toolforge Kubernetes cluster
[0].

[0]: https://phabricator.wikimedia.org/T195689

Bryan
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


Re: [Cloud-announce] c1.labsdb (labsdb1001) hardware failure

2017-11-02 Thread Bryan Davis
On Wed, Nov 1, 2017 at 9:42 AM, Bryan Davis <bd...@wikimedia.org> wrote:
> TL;DR:
> * c1.labsdb (labsdb1001.eqiad.wmnet) is down due to hardware issues
> * *.labsdb are pointing to c3.labsdb (labsdb1003.eqiad.wmnet)

Manuel (one of our awesome DBAs) was able to get the MySQL server for
c1.labsdb back up and running in read-only mode.

We do not know how much longer the failing SSD drive will survive, but
until it fails again, users who have user created databases on this
host can attempt to connect and archive them. If the data you have
stored there is something that you can recreate, you should probably
do that instead.

Please move non-reproducible data to tools.db.svc.eqiad.wmflabs.
c3.labsdb (labsdb1003.eqiad.wmnet) will be shutdown on Wednesday 13
December 2017, so moving any data to c3 will gain you less than 6
weeks. You should also be working to update your tools that need to
write data to a database to use tools.db.svc.eqiad.wmflabs. See
<https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown>
for more details.

Due to the failure of labsdb1001.eqiad.wmnet, the planned reboot of
labsdb1003.eqiad.wmnet on Tuesday 07 November 2017 has been cancelled.

Bryan (on behalf of the Cloud Services and DBA teams)
-- 
Bryan Davis  Wikimedia Foundation<bd...@wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services  Boise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] c1.labsdb (labsdb1001) hardware failure

2017-11-01 Thread Bryan Davis
TL;DR:
* c1.labsdb (labsdb1001.eqiad.wmnet) is down due to hardware issues
* *.labsdb are pointing to c3.labsdb (labsdb1003.eqiad.wmnet)

The physical server behind c1.labsdb (labsdb1001.eqiad.wmnet)
experienced a hard drive failure around 2017-11-01T03:30 UTC. This
failure is preventing the MySQL service on that host from starting.
The *.labsdb service names that were pointed at that server have been
updated to point to c3.labsdb (labsdb1003.eqiad.wmnet) instead.

See <https://phabricator.wikimedia.org/T179464> for more information
and additional updates.

Expect slower than normal performance as all traffic is handled by a
single server. Now would be a great time to update the configuration
for your tools to use the new database cluster [0][1].

[0]: 
https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/
[1]: https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown

Bryan
-- 
Bryan Davis  Wikimedia Foundation<bd...@wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services  Boise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce


[Cloud-announce] Wiki Replica c1.labsdb to be rebooted Monday 2017-10-30 14:30 UTC

2017-10-27 Thread Bryan Davis
labsdb1001.eqiad.wmnet (aka c1.labsdb) will be rebooted at 2017-10-30
14:30 UTC for kernel updates
(<https://phabricator.wikimedia.org/T168584>).

Normal usage of the *.labsdb databases should experience only limited
interruption as DNS is changed to point to the labsdb1003.eqiad.wmnet
(aka c3.labsdb). The c1.labsdb service name will *not* be updated
however, so tools hardcoded to that service name will be interrupted
until the reboot is complete.

There is a possibility of catastrophic hardware failure in this
reboot. There will be no way to recover the server or the data it
currently hosts if that happens. Tools that are hosting self-created
data on c1.labsdb *will* lose that data if there is hardware failure.
If you are unsure if your tool is hosting data on c1.labsdb, you can
check at <https://tools.wmflabs.org/tool-db-usage/>.

This reboot is an intermediate step before the complete shutdown of
the server on Wednesday 2017-12-13. See
<https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown>
for more information.

Bryan (on behalf of the Wikimedia Cloud Services and DBA teams)
-- 
Bryan Davis  Wikimedia Foundation<bd...@wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services  Boise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikimedia Cloud Services announce mailing list
Cloud-announce@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce