[Wikitech-l] Re: Infrastructure diagrams

2022-10-24 Thread Selena Deckelmann
Thanks so much, Timo!  Having these snapshots of our systems and
infrastructure, and also capturing how it evolves over time, is invaluable.

-selena

On Mon, Oct 24, 2022 at 6:51 AM Krinkle  wrote:

> I've done a major update to a number of diagrams on Wikitech.
>
> Usually, I don't mention an update here, but I'm highlighting it now as
> it's been a while since we mentioned them on-list and the community and
> foundation have grown a lot so some of these may be new to you.
>
> Given how much has changed in recent changes, I also included a changelog
> and a link to where in the docs you'd normally discover this diagram
> on-wiki:
>
> *== 1. File:Wikipedia_webrequest_2022.png
> 
> (Updated) ==*
>
> This is a highly simplified diagram, covering the general shape of our
> stack through the example of a typical Wikipedia webrequest.
>
> Previous:
> https://upload.wikimedia.org/wikipedia/commons/b/b3/Wikipedia_webrequest_flow_2020.png
> New:
> https://upload.wikimedia.org/wikipedia/commons/4/4d/Wikipedia_webrequest_2022.png
> Documentation: wikitech:MediaWiki_at_WMF
>  and
> wikitech:Caching_overview
> .
> Notable changes:
> * Change edge TLS termination ("HTTPS") from ats-tls to HAProxy. I wrote a
> "Caching overview § History
> " section.
> * Change appserver TLS from Nginx- to Envoy.
> * Add new MainStash DB.
> * Include storage ExternalStore DB, ParserCache DB, and Swift media.
> * Include services Shellbox, Mathoid, and Kask.
>
> *== 2. File:WMF_infrastructure_2022.png
> 
> (Updated) ==*
>
> This is a continous attempt at an overview of tier-1/user-facing
> infrastructure. It will likely never be complete from all POV, but.. it is
> more accurate and complete than it has been. Thanks to all that contributed
> by entertaining my many questions over the years.
>
> Previous (2016 by Elukey):
> https://upload.wikimedia.org/wikipedia/labs/4/4d/Infrastructure_overview.png
> New:
> https://upload.wikimedia.org/wikipedia/commons/4/48/WMF_infrastructure_2022.png
> Documentation: wikitech:Wikimedia_infrastructure
>  and
> wikitech:Purged 
> Notable changes:
> * Add new Drmrs data center in Marseille, France.
> * Add new services: purged.go, EventStreams, Thumbor, mcrouter, Envoy,
> etcd.
> * Add new distinction for Multi-DC between primary and secondary data
> center.
> * Change sessionstore from Redis to Kask/Cassandra.
> * Change jobqueue from Redis to EventGate/Kafka.
> * Include distinct MediaWiki server roles and clusters.
> * Include high-level MediaWiki platform components.
> * Include example flow for "JobQueue job" and "CDN purge".
>
> *== 3. File:MediaWiki_infrastructure_2022.png
> 
> (New) ==*
>
> Similar to WMF Infra diagram, but more abstract around DC and services,
> and more detailed within the platform. Including more core services, and
> recognising extensions as their own layer.
>
> New:
> https://upload.wikimedia.org/wikipedia/commons/e/ee/MediaWiki_infrastructure_2022.png
> Documentation: wikitech:MediaWiki_at_WMF
> 
>
> *== 4. File:Wikipedia_Memcached_flow_2022.png
> 
> (Updated)*
>
> Previous:
> https://upload.wikimedia.org/wikipedia/commons/d/db/Wikipedia_Memcached_flow_2020.png
> New:
> https://upload.wikimedia.org/wikipedia/commons/4/45/Wikipedia_Memcached_flow_2022.png
> Documentation: wikitech:Memcached_for_MediaWiki
> 
> Notable changes:
> * Include the three tiers of ParserCache.
> * Add WANCache legend to explain different keytypes you may encounter on
> the network.
> * Add full name of the mcrouter-with-onhost-tier service for greppability.
> * Add new WRStats service (T310662
> ). This was part of Multi-DC
> work
> 
> to reduce primary DB writes and (not bi-di replicated) Redis use in
> AbuseFilter. This service also replaces the old "User ping limiter" in core
> and is now able to serve both use cases.
> * Remove "on-host: soon" labels. Adopting on-host memc for WANCache was
> considered not worth the added runtime complexity (T264604
> ). Note that SRE's work on
> adding 10G network links for memcached hosts, and the addition of
> mcrouter-managed gutter pools take care of the general usecase that we were
> exploring on-host for. We kept it for ParserCach

[Wikitech-l] Reflecting on my listening tour

2023-04-13 Thread Selena Deckelmann
[also posted to wikimedia-l]


Hi everyone,

I joined the Wikimedia Foundation on August 1 of last year in a newly
created role as the Chief Product and Technology Officer (CPTO). (For the
first few weeks, some of the staff called me C3PO as they got used to the
new title :) The role was created to bring both the Product and Technology
departments back under a single accountable leader for the first time since
about 2015. Like Maryana
,
I decided to spend the first few months of my time at Wikimedia listening
and learning. Although I come from the open source technology field, and
have worked with volunteers and communities in prior jobs, it felt
important to start here with curiosity and openness about what’s working
well and what needs to change.

Since then, I have met one on one and in small groups with more than 360
people, who spoke with me from 38 different countries. I also attended 22
large and small convenings and events which included about 3,150 people.
This includes members of the Foundation’s product and technology teams,
other Foundation staff, editors, functionaries, affiliates, movement
organizers and open internet partners. I tried to approach every
conversation with curiosity, openness, and eagerness, letting go of any
preconceptions I may have had (intentionally embracing beginner’s mind
) about the Foundation, the
Wikimedia projects, and communities worldwide that contribute to creating
and sharing free knowledge. I can confirm that I quickly found myself awash
in details, experiencing a firehose of information from all sides! My
husband and two young children have also learned a lot more about this
movement in the last six months than you might expect.

To provide myself with some structure, I asked everyone the same kind of
questions about: (1) the impact our product and technology organizations
have had on the movement and/or the world in the last five years, and what
people were most proud of; (2) the current vision and strategy and if they
will take us where we need to go; and (3) the most promising opportunities
that people see in our work, and what is needed to realize that potential.

I want to thank everyone who took the time to share with me, and I’ve
included some direct, anonymized quotes in this letter from the
conversations I had. And I want to confirm that the listening continues — I
will create more spaces in the year ahead for dedicated conversations about
some of the important topics I have highlighted below. I will also be
posting this letter to Meta.

Pulling in the same direction: More visible and shared metrics

On a page of the first notebook I had for my onboarding, I quoted a person
who said they just wanted "meaningful common goals." This was a theme
repeated over and over — a clear desire from everyone to do work together
that was linked by common purpose, and with all the volunteers that have
created all Wikimedia projects. I got to hear so many different voices, and
I heard the details from every side — what’s working, what hasn’t been
working for a long time — some of the problems we face are over ten years
old. People shared what’s missing, what’s extra, who’s fighting to be heard
and who’s feeling lost at sea.

"I think there are lots of promising opportunities to incentivise people to
pay off technical debt and make our existing stack more sustainable. Right
now there are no incentives for engineers in this regard."

"Are we really having impact?"

How can we unite behind meaningful common goals? And which metrics matter
the most? We have so much data, but we really need lodestar
 (or some refer to this as north
star) metrics across the whole Foundation, a system for reviewing and
reflecting on what we learn from them, and then a way to connect those
metrics with the day to day work everyone is doing.

To get at that, we’re doing two main things — one is deepening our
understanding of volunteer activities and the health of the volunteer
communities. This will be through working closely with volunteers using
existing processes and sharing what we’re learning, as well as qualitative
and quantitative research workstreams, including reviewing existing
research of volunteer activities and typical work profiles. The other is
working to establish a set of Foundation-wide lodestar metrics. Shared
metrics help everyone understand how we’re measuring success across the
Foundation, and we’re sharing these publicly as part of our Annual Plan.
Over time, we plan to bring our measures of success for important
initiatives to communities for conversations and debate to help everyone
align what success might look like. Shared metrics and data will empower us
to make more effective and better decisions, along with collaboration with
those who are working on changes and those who may 

[Wikitech-l] Re: Reflecting on my listening tour

2023-04-18 Thread Selena Deckelmann
Hi Dan,

Thank you so much for sharing this story.

Similarly, I once was colleagues with a group of people working on process
isolation (
https://en.m.wikipedia.org/wiki/Process_isolation) for Firefox. They had
sort of hit a wall where the memory usage was going to be far more that we
thought users could tolerate, and fixing the memory problems would take
quite a few more engineers than anyone thought we could spare. Then,
Spectre/Meltdown (
https://meltdownattack.com/) happened. It so happened that we were together
at an all company meeting, so a group of us got together in a room and
talked about what we needed. The group left the room understanding that
this work was critically important, that we needed a dedicated team, and we
ended up forming a larger team to ship the work with the support (although
not unanimous!) of managers and staff. A lot more to the story before and
after, but that was the beginning of phase where Firefox ended up actually
shipping process isolation.

What I learned is that it’s possible for critically important change to
happen that might be stuck if we all have a very good reason to move it
forward (like a very scary security problem!).

The challenge in prioritization I see for WMF is that we need to find these
good reasons, prioritize and do work in small enough chunks that we are
able to evaluate progress and adjust course where needed. It’s common to
slip into analysis paralysis, or believe that it’s too hard to set short
term milestones that deliver significant value to someone.

Finding ways to move forward together this way is what I see as the path.
It has part of the urgency in your story or mine (Meltdown vulns
fortunately aren’t happening every quarter!), but balanced with some kind
of repeatable process.

-selena


On Mon, Apr 17, 2023 at 3:18 PM Dan Garry (Deskana) 
wrote:

> Despite agreeing wholeheartedly that technical debt, product debt,
> ownership, and maintenance are persistent problems, here's a story about
> when this *didn't* happen, which maybe we can learn from.
>
> Disclaimer: this is from my memory of 2014! Warning, potential inaccuracy
> and rose-tinted glasses!
>
> We had a global login system (single user login, or SUL) but it was in a
> bit of disarray. There were many local accounts disconnected from global
> ones because they were made before the global login system, many username
> conflicts that went unaddressed. Users were given some tools to resolve
> these conflicts, but not enough to actually finalise the whole thing. We
> all agreed it needed solving. We all new the end state we wanted. But,
> there were multiple technical and product solutions to get there, and no
> actual concerted effort to do it. Many of the username conflicts were
> between long-time community members, so we were sure to get some dedicated
> volutneers angry no matter how we did it. So it sat in limbo, annoying
> everyone, and never happening. Sound familiar?
>
> Around then, WMF leadership introduced a new prioritisation framework:
> "top 5 priorities". This was a ranked list of projects that were considered
> to be more important than others for that quarter. It was intended as a
> first attempt to combat the "if everything's important, nothing's
> important" syndrome. You can't argue with a ranked list! And, number one on
> the list for the first quarter, not something new and shiny, but an old
> one: the SUL finalisation
> ! Sort it all out, once
> and for all.
>
> Erik Möller (the then VP of Product and Engineering, de facto CPO and CTO
> really, reflecting on it) asked me to be the product manager. I was very
> inexperienced as a PM but had been an editor for eight years, so I
> understood the problem well. Still, I wondered how we were possibly going
> to achieve anything, the project had been "in progress" for years with
> almost no progress. Erik asked me what I needed to make it happen. I got
> some advice, and said I need a systems designer, a systems architect, an
> engineer that knows the community well, and a community liaison. Erik went
> and had the hard conversations with the people that currently needed said
> people ("It's top priority this quarter, the other stuff has to wait.") and
> went and got those people. We figured it all out, and we did it, once and
> for all (timeline reduced, it did still take multiple quarters, but we knew
> that going in). Everyone now has a fully global account!
>
> Now, times were simpler back then. This exact technique wouldn't work now,
> for multiple reasons. But, I wonder what we can learn from this as an
> organisation. What would it take to repeat this achievement?
>
> Dan
>
> P.S. Some of that team I worked with are still on this list. Hello! Thank
> you for the growth as a PM that I got out of that project, and for beating
> my inexperienced head around a bit until it got more experienced.
> ___
> Wikitech-l mailing