TL;DR:
* https://toolsadmin.wikimedia.org now allows marking a tool as "disabled".
* Disabling a tool will immediately stop any running jobs including
webservices and prevent maintainers from logging in as the tool.
* Disabled tools are archived and deleted after 40 days.
* Disabled tools can be re-enabled at any time prior to being archived
and deleted.

"How can I delete a tool that I no longer want?" is a question that
folks have been asking for a very long time. I know of Phabricator
tasks going back to at least April 2016 [0] tracking such requests. A
bit over 5 years ago I created a Phabricator task to track figuring
out how to delete an unused tool [1]. Nearly 18 months ago Andrew
Bogott started to look into how we could automate the checklist of
cleanup steps that had been developed. By January 2022 Andrew had
implemented all of the pieces needed complete the checklist. This came
with a command line tool that Toolforge admins have been able to use
to delete a tool. Today we have released updates to Striker
(<https://toolsadmin.wikimedia.org>) which finally expose a "disable
tool" button to a tool's maintainers [2].

When a tool is marked as disabled any running jobs it has on the Grid
Engine or Kubernetes backends are stopped. Changes are also made so
that new jobs cannot be started, any crontab file is archived, and
maintainers are prevented from using `become <tool>`. Normally things
stay in this state for 40 days to give everyone a chance to change
their minds and re-enable to tool. Once the 40 day timer expires, the
system will proceed with cleanup tasks that are more difficult to
reverse including archiving and deleting the tool's $HOME and ToolsDB
databases. Ultimately the tool's group and user are deleted from the
LDAP directory which functionally completes the process.

A lot of system administration tasks are kind of boring, but this work
turned out to be actually pretty interesting. A Toolforge tool can
include quite a number of different parts. There can be jobs running
on the Grid Engine and/or Kubernetes, a crontab to start jobs
periodically, a database in ToolsDB, credentials for accessing the
Wiki Replicas, credentials for accessing the Toolforge Elasticsearch
cluster, a $HOME directory on the Toolforge NFS server, and account
information in the LDAP directory that powers Developer accounts and
Cloud VPS credentials. All of these things would ideally be removed
when a tool was successfully deleted. Some of them are things that we
would like to create historical archives of incase someone wanted to
recreate the tool's functionality. And in a perfect world we would
also be able to change our minds and start the tool back up if things
had not progressed to fully deleting the tool.

Andrew came up with a fairly elegant system to deal with this
complexity. He designed a series of processes which are each
responsible for a slice of the overall complexity. A process running
on the Grid controller is responsible for stopping running Grid Engine
jobs and changing the tool's quota so that no new jobs can be started.
A process running on the Crontab server archives the tool's crontab
configuration. A process running on the Kubernetes controller deletes
the tool's credentials for accessing the Kubernetes cluster, the
tool's namespace, and by extension removes all processes running in
the namespace. A process running on the NFS controller archives the
tool's $HOME directory contents and deletes the directory. It also
removes the tool from other LDAP membership lists (a tool can be a
co-maintainer of another tool) and deletes the tool's user and group
from the LDAP directory. A process archives ToolsDB tables. Another
process removes the tool's database credentials across the ToolsDB and
Wiki Replicas server pools. Many of these processes are implemented in
cloud/toolforge/disable-tool on Gerrit [3]. Others were added to
existing management controllers for creating Kubernetes and database
credentials. The processes all take cues from the LDAP directory and
tracking files in the tool's $HOME to create an eventually consistent,
decoupled collection of clean up actions.

We still have some work to do to update documentation on wikitech and
Phabricator so that folks know where to find the new buttons. If you
find documentation that needs to be updated before someone else gets
to it, please feel empowered to be [[WP:BOLD]] and update them.

[0]: https://phabricator.wikimedia.org/T133777
[1]: https://phabricator.wikimedia.org/T170355
[2]: https://phabricator.wikimedia.org/T285403
[3]: 
https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/disable-tool/
[[WP:BOLD]]: https://en.wikipedia.org/wiki/Wikipedia:Be_bold

Bryan, on behalf of the Toolforge administration team
-- 
Bryan Davis              Technical Engagement      Wikimedia Foundation
Principal Software Engineer                               Boise, ID USA
[[m:User:BDavis_(WMF)]]                                      irc: bd808
_______________________________________________
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/

Reply via email to