Maintenance done, all services redeployed, Maps and Toolhub are back up and in their initial state.
Thanks for your patience, On Wed, Oct 1, 2025 at 12:55 PM Clément Goubert <[email protected]> wrote: > Starting maintenance. > > On Wed, Oct 1, 2025 at 11:54 AM Clément Goubert <[email protected]> > wrote: > >> Hello everyone, >> >> An update on the status of Maps for the upcoming upgrade. >> >> >> *Short version:* >> >> >> *Maps will serve some stale map tiles for the next few hours.* >> *Rationale:* >> The OSM map tile cache is still being refreshed, there are a lot of >> elements to fetch and we couldn't make that happen before the upgrade. This >> refresh will keep happening during the migration, so the amount of stale >> tiles served will go down as time passes. We decided this was the best of >> the three options available to us, the other two being depooling the >> service entirely and having maps be unavailable for the duration of the >> maintenance, and pushing the date of the upgrade in the future, which would >> snowball into pushing back the eqiad repool. >> >> --- >> >> Object: Kubernetes upgrade to 1.31 >> >> Target: eqiad Wikikube cluster >> >> Maintenance window: 2025-10-01 10:00 >> <https://zonestamp.toolforge.org/1759312800>-15:00 >> <https://zonestamp.toolforge.org/1759330800> UTC >> >> Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to >> kubernetes 1.31 <https://phabricator.wikimedia.org/T405703> >> >> Operational channel: IRC #wikimedia-sre >> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>, announcements >> will be made to IRC #wikimedia-operations >> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations> >> >> Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops >> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>) >> >> Impact: >> >> Users: >> >> - >> >> Toolhub will be down for the duration of the window. >> - >> >> Maps may experience some perturbation during this maintenance, >> most probably serving stale map tiles while the cache is being >> refreshed. >> >> >> - >> >> No user impact for other services >> >> Deployers: >> >> - >> >> Deployments to the target cluster will be unavailable. This includes >> MediaWiki backports and deployments. DO NOT DEPLOY. >> - >> >> The following deployment windows are cancelled: >> - >> >> Services: Citoid/Zotero 11:00 UTC >> <https://zonestamp.toolforge.org/1759316400> >> - >> >> UTC Afternoon Backport Window 13:00 UTC >> <https://zonestamp.toolforge.org/1759330800> >> - >> >> Wikifunctions Services UTC Afternoon 14:00 UTC >> <https://zonestamp.toolforge.org/1759327200> >> >> Process: >> >> All steps handled by SRE ServiceOps >> >> - >> >> Maintenance start is announced on #wikimedia-operations and as reply >> to this email chain >> - >> >> All deployments are stopped >> - >> >> SRE ServiceOps ensures all current versions of deployments can be >> safely deployed >> - >> >> Maintenance begins and should take a couple of hours >> - >> >> Maps is switched over to codfw new stack, perturbations may start >> - >> >> Toolhub downtime starts >> - >> >> Possible Maps fallback to codfw old stack >> - >> >> Cluster is wiped and upgraded >> - >> >> Maps and Toolhub are redeployed first to minimize downtime >> - >> >> Maps is switched back to eqiad, perturbations end >> - >> >> Toolhub downtime stops >> - >> >> SRE ServiceOps redeploys all target cluster services >> - >> >> Maintenance end is announced on #wikimedia-operations and as reply to >> this email chain >> - >> >> Deployments resume >> >> Rationale: >> >> The date was chosen for convenience as due to the data center switchover >> process <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad >> is currently fully depooled, receiving almost no traffic. eqiad is >> scheduled to be repooled on 2025-10-02 >> <https://zonestamp.toolforge.org/1759417200>, which would complicate the >> upgrade. With eqiad already drained, we expect no visible user impact. >> >> SRE ServiceOps will be checking that all services can be safely deployed >> before the maintenance, and will be redeploying all services before marking >> the cluster as usable. Deployers are not required to re-deploy their >> services, unless they have been informed to do so by SRE ServiceOps. >> >> During last week’s switchover <https://phabricator.wikimedia.org/T399891>, >> Toolhub remained in eqiad. This means that there will be an expected >> unavoidable small downtime of a few hours. To minimize Toolhub’s downtime, >> we will prioritize its redeployment during the initialization phase. >> >> As part of the work to upgrade the Maps infrastructure >> <https://phabricator.wikimedia.org/T381565> and bring the kartotherian >> service to Wikikube, kartotherian is currently single-homed in eqiad >> Wikikube, using the old buster-based stack as a backend. The new >> bookworm-based stack in codfw is being brought up quickly, so we will use >> this maintenance as an opportunity to shift traffic to it (Case 1). In >> addition, we are also warming up the old buster-based stack in codfw so we >> can fall back to it in case issues arise (Case 2). As of 15 minutes before >> the maintenance, the OSM map tile cache is still being refreshed. There >> are a lot of elements to fetch and we couldn't make that happen before the >> upgrade. This refresh will keep happening during the migration, so the >> amount of stale tiles served will go down as time passes. We decided this >> was the best of the three options available to us, the other two being >> depooling the service entirely and having maps be unavailable for the >> duration of the maintenance, and pushing the date of the upgrade in the >> future, which would snowball into pushing back the eqiad repool. >> >> Thank you for your understanding and support! If you have any questions >> regarding this process, please respond to this email, comment on >> Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31 >> <https://phabricator.wikimedia.org/T405703>, or reach out directly to me >> (IRC nickname claime on #wikimedia-serviceops >> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>). >> >> On behalf of SRE ServiceOps, >> >> >> On Tue, Sep 30, 2025 at 4:54 PM Clément Goubert <[email protected]> >> wrote: >> >>> Hello everyone, >>> >>> A quick update on additional impact for the upcoming maintenance. Fully >>> updated maintenance description at the end of this email. >>> >>> Short version: >>> >>> The Maps infrastructure may experience some perturbation during this >>> maintenance. >>> >>> Impact: >>> >>> Users: >>> >>> - >>> >>> Case 1: The new bookworm-based codfw stack performs well and service >>> disruption should be minimal >>> - >>> >>> Case 2: If errors are experienced with the new codfw stack, the >>> fallback to the old codfw stack will come with some OSM-data lag, as yet >>> unmeasurable >>> >>> Mitigation: >>> >>> - >>> >>> Maps will be redeployed with the same priority as Toolhub to >>> minimize downtime. >>> >>> Rationale: >>> >>> As part of the work to upgrade the Maps infrastructure >>> <https://phabricator.wikimedia.org/T381565> and bring the kartotherian >>> service to Wikikube, kartotherian is currently single-homed in eqiad >>> Wikikube, using the old buster-based stack as a backend. >>> >>> The new bookworm-based stack in codfw is being brought up quickly, so we >>> will use this maintenance as an opportunity to shift traffic to it (case >>> 1). In addition, we are also warming up the old buster-based stack in codfw >>> so we can fall back to it in case issues arise (case 2). >>> >>> --- >>> >>> Object: Kubernetes upgrade to 1.31 >>> >>> Target: eqiad Wikikube cluster >>> >>> Maintenance window: 2025-10-01 10:00 >>> <https://zonestamp.toolforge.org/1759312800>-15:00 >>> <https://zonestamp.toolforge.org/1759330800> UTC >>> >>> Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to >>> kubernetes 1.31 <https://phabricator.wikimedia.org/T405703> >>> >>> Operational channel: IRC #wikimedia-sre >>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>, >>> announcements will be made to IRC #wikimedia-operations >>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations> >>> >>> Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops >>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>) >>> >>> Impact: >>> >>> Users: >>> >>> - >>> >>> Toolhub will be down for the duration of the window. >>> - >>> >>> Maps may experience some perturbation during this maintenance. >>> >>> >>> - >>> >>> No user impact for other services >>> >>> Deployers: >>> >>> - >>> >>> Deployments to the target cluster will be unavailable. This includes >>> MediaWiki backports and deployments. DO NOT DEPLOY. >>> - >>> >>> The following deployment windows are cancelled: >>> - >>> >>> Services: Citoid/Zotero 11:00 UTC >>> <https://zonestamp.toolforge.org/1759316400> >>> - >>> >>> UTC Afternoon Backport Window 13:00 UTC >>> <https://zonestamp.toolforge.org/1759330800> >>> - >>> >>> Wikifunctions Services UTC Afternoon 14:00 UTC >>> <https://zonestamp.toolforge.org/1759327200> >>> >>> Process: >>> >>> All steps handled by SRE ServiceOps >>> >>> - >>> >>> Maintenance start is announced on #wikimedia-operations and as reply >>> to this email chain >>> - >>> >>> All deployments are stopped >>> - >>> >>> SRE ServiceOps ensures all current versions of deployments can be >>> safely deployed >>> - >>> >>> Maintenance begins and should take a couple of hours >>> - >>> >>> Maps is switched over to codfw new stack, perturbations may start >>> - >>> >>> Toolhub downtime starts >>> - >>> >>> Possible Maps fallback to codfw old stack >>> - >>> >>> Cluster is wiped and upgraded >>> - >>> >>> Maps and Toolhub are redeployed first to minimize downtime >>> - >>> >>> Maps is switched back to eqiad, perturbations end >>> - >>> >>> Toolhub downtime stops >>> - >>> >>> SRE ServiceOps redeploys all target cluster services >>> - >>> >>> Maintenance end is announced on #wikimedia-operations and as reply >>> to this email chain >>> - >>> >>> Deployments resume >>> >>> Rationale: >>> >>> The date was chosen for convenience as due to the data center >>> switchover process >>> <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad is >>> currently fully depooled, receiving almost no traffic. eqiad is scheduled >>> to be repooled on 2025-10-02 >>> <https://zonestamp.toolforge.org/1759417200>, which would complicate >>> the upgrade. With eqiad already drained, we expect no visible user impact. >>> >>> SRE ServiceOps will be checking that all services can be safely deployed >>> before the maintenance, and will be redeploying all services before marking >>> the cluster as usable. Deployers are not required to re-deploy their >>> services, unless they have been informed to do so by SRE ServiceOps. >>> >>> During last week’s switchover >>> <https://phabricator.wikimedia.org/T399891>, Toolhub remained in eqiad. >>> This means that there will be an expected unavoidable small downtime of a >>> few hours. To minimize Toolhub’s downtime, we will prioritize its >>> redeployment during the initialization phase. >>> >>> As part of the work to upgrade the Maps infrastructure >>> <https://phabricator.wikimedia.org/T381565> and bring the kartotherian >>> service to Wikikube, kartotherian is currently single-homed in eqiad >>> Wikikube, using the old buster-based stack as a backend. The new >>> bookworm-based stack in codfw is being brought up quickly, so we will use >>> this maintenance as an opportunity to shift traffic to it (Case 1). In >>> addition, we are also warming up the old buster-based stack in codfw so we >>> can fall back to it in case issues arise (Case 2). >>> >>> Thank you for your understanding and support! If you have any questions >>> regarding this process, please respond to this email, comment on >>> Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31 >>> <https://phabricator.wikimedia.org/T405703>, or reach out directly to >>> me (IRC nickname claime on #wikimedia-serviceops >>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>). >>> >>> On behalf of SRE ServiceOps, >>> >>> On Mon, Sep 29, 2025 at 5:37 PM Clément Goubert <[email protected]> >>> wrote: >>> >>>> Hello everyone, >>>> >>>> Short version: >>>> >>>> We will be upgrading the eqiad Wikikube kubernetes >>>> <https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#WikiKube> >>>> cluster to 1.31 on Wednesday 2025-10-01 starting at 10:00 UTC >>>> <https://zonestamp.toolforge.org/1759312800>, ending at 15:00 UTC >>>> <https://zonestamp.toolforge.org/1759330800>. >>>> >>>> Toolhub will be down during this maintenance. >>>> >>>> If you are deploying services to the eqiad Wikikube kubernetes cluster: >>>> >>>> - >>>> >>>> Deployments will be unavailable during the maintenance. DO NOT >>>> DEPLOY. >>>> - >>>> >>>> SRE will redeploy all services >>>> - >>>> >>>> SRE will announce the end of maintenance, at which point the >>>> cluster will be usable again >>>> >>>> --- >>>> >>>> Object: Kubernetes upgrade to 1.31 >>>> >>>> Target: eqiad Wikikube cluster >>>> >>>> Maintenance window: 2025-10-01 10:00 >>>> <https://zonestamp.toolforge.org/1759312800>-15:00 >>>> <https://zonestamp.toolforge.org/1759330800> UTC >>>> >>>> Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to >>>> kubernetes 1.31 <https://phabricator.wikimedia.org/T405703> >>>> >>>> Operational channel: IRC #wikimedia-sre >>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>, >>>> announcements will be made to IRC #wikimedia-operations >>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations> >>>> >>>> Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops >>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>) >>>> >>>> Impact: >>>> >>>> Users: >>>> >>>> - >>>> >>>> Toolhub will be down for the duration of the window. >>>> - >>>> >>>> No user impact for other services. >>>> >>>> Deployers: >>>> >>>> - >>>> >>>> Deployments to the target cluster will be unavailable. This >>>> includes MediaWiki backports and deployments. DO NOT DEPLOY. >>>> - >>>> >>>> The following deployment windows are cancelled: >>>> - >>>> >>>> Services: Citoid/Zotero 11:00 UTC >>>> <https://zonestamp.toolforge.org/1759316400> >>>> - >>>> >>>> UTC Afternoon Backport Window 13:00 UTC >>>> <https://zonestamp.toolforge.org/1759330800> >>>> - >>>> >>>> Wikifunctions Services UTC Afternoon 14:00 UTC >>>> <https://zonestamp.toolforge.org/1759327200> >>>> >>>> Process: >>>> >>>> All steps handled by SRE ServiceOps >>>> >>>> - >>>> >>>> Maintenance start is announced on #wikimedia-operations and as >>>> reply to this email chain >>>> - >>>> >>>> All deployments are stopped >>>> - >>>> >>>> SRE ServiceOps ensures all current versions of deployments can be >>>> safely deployed >>>> - >>>> >>>> Maintenance begins and should take a couple of hours >>>> - >>>> >>>> Toolhub downtime starts >>>> - >>>> >>>> Cluster is wiped and upgraded >>>> - >>>> >>>> Toolhub is redeployed first to minimize downtime >>>> - >>>> >>>> Toolhub downtime stops >>>> - >>>> >>>> SRE ServiceOps redeploys all target cluster services >>>> - >>>> >>>> Maintenance end is announced on #wikimedia-operations and as reply >>>> to this email chain >>>> - >>>> >>>> Deployments resume >>>> >>>> Rationale: >>>> >>>> The date was chosen for convenience as due to the data center >>>> switchover process >>>> <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad is >>>> currently fully depooled, receiving almost no traffic. eqiad is scheduled >>>> to be repooled on 2025-10-02 >>>> <https://zonestamp.toolforge.org/1759417200>, which would complicate >>>> the upgrade. With eqiad already drained, we expect no visible user impact. >>>> >>>> SRE ServiceOps will be checking that all services can be safely >>>> deployed before the maintenance, and will be redeploying all services >>>> before marking the cluster as usable. Deployers are not required to >>>> re-deploy their services, unless they have been informed to do so by SRE >>>> ServiceOps. >>>> >>>> During last week’s switchover >>>> <https://phabricator.wikimedia.org/T399891>, Toolhub remained in >>>> eqiad. This means that there will be an expected unavoidable small downtime >>>> of a few hours. To minimize Toolhub’s downtime, we will prioritize its >>>> redeployment during the initialization phase. >>>> >>>> >>>> >>>> Thank you for your understanding and support! If you have any questions >>>> regarding this process, please respond to this email, comment on >>>> Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31 >>>> <https://phabricator.wikimedia.org/T405703>, or reach out directly to >>>> me (IRC nickname claime on #wikimedia-serviceops >>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>). >>>> >>>> On behalf of SRE ServiceOps, >>>> >>>> -- >>>> Clément 'claime' Goubert (they/them) >>>> Senior SRE >>>> Wikimedia Foundation >>>> >>> >>> >>> -- >>> Clément 'claime' Goubert (they/them) >>> Senior SRE >>> Wikimedia Foundation >>> >> >> >> -- >> Clément 'claime' Goubert (they/them) >> Senior SRE >> Wikimedia Foundation >> > > > -- > Clément 'claime' Goubert (they/them) > Senior SRE > Wikimedia Foundation > -- Clément 'claime' Goubert (they/them) Senior SRE Wikimedia Foundation
_______________________________________________ Wikitech-l mailing list -- [email protected] To unsubscribe send an email to [email protected] https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
