Maintenance done, all services redeployed, Maps and Toolhub are back up and
in their initial state.

Thanks for your patience,

On Wed, Oct 1, 2025 at 12:55 PM Clément Goubert <[email protected]>
wrote:

> Starting maintenance.
>
> On Wed, Oct 1, 2025 at 11:54 AM Clément Goubert <[email protected]>
> wrote:
>
>> Hello everyone,
>>
>> An update on the status of Maps for the upcoming upgrade.
>>
>>
>> *Short version:*
>>
>>
>> *Maps will serve some stale map tiles for the next few hours.*
>> *Rationale:*
>> The OSM map tile cache is still being refreshed, there are a lot of
>> elements to fetch and we couldn't make that happen before the upgrade. This
>> refresh will keep happening during the migration, so the amount of stale
>> tiles served will go down as time passes. We decided this was the best of
>> the three options available to us, the other two being depooling the
>> service entirely and having maps be unavailable for the duration of the
>> maintenance, and pushing the date of the upgrade in the future, which would
>> snowball into pushing back the eqiad repool.
>>
>> ---
>>
>> Object: Kubernetes upgrade to 1.31
>>
>> Target: eqiad Wikikube cluster
>>
>> Maintenance window: 2025-10-01 10:00
>> <https://zonestamp.toolforge.org/1759312800>-15:00
>> <https://zonestamp.toolforge.org/1759330800> UTC
>>
>> Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to
>> kubernetes 1.31 <https://phabricator.wikimedia.org/T405703>
>>
>> Operational channel: IRC #wikimedia-sre
>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>, announcements
>> will be made to IRC #wikimedia-operations
>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations>
>>
>> Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops
>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>)
>>
>> Impact:
>>
>> Users:
>>
>>    -
>>
>>    Toolhub will be down for the duration of the window.
>>    -
>>
>>       Maps may experience some perturbation during this maintenance,
>>       most probably serving stale map tiles while the cache is being 
>> refreshed.
>>
>>
>>    -
>>
>>    No user impact for other services
>>
>> Deployers:
>>
>>    -
>>
>>    Deployments to the target cluster will be unavailable. This includes
>>    MediaWiki backports and deployments. DO NOT DEPLOY.
>>    -
>>
>>    The following deployment windows are cancelled:
>>    -
>>
>>       Services: Citoid/Zotero 11:00 UTC
>>       <https://zonestamp.toolforge.org/1759316400>
>>       -
>>
>>       UTC Afternoon Backport Window 13:00 UTC
>>       <https://zonestamp.toolforge.org/1759330800>
>>       -
>>
>>       Wikifunctions Services UTC Afternoon 14:00 UTC
>>       <https://zonestamp.toolforge.org/1759327200>
>>
>> Process:
>>
>> All steps handled by SRE ServiceOps
>>
>>    -
>>
>>    Maintenance start is announced on #wikimedia-operations and as reply
>>    to this email chain
>>    -
>>
>>    All deployments are stopped
>>    -
>>
>>    SRE ServiceOps ensures all current versions of deployments can be
>>    safely deployed
>>    -
>>
>>    Maintenance begins and should take a couple of hours
>>    -
>>
>>    Maps is switched over to codfw new stack, perturbations may start
>>    -
>>
>>    Toolhub downtime starts
>>    -
>>
>>    Possible Maps fallback to codfw old stack
>>    -
>>
>>    Cluster is wiped and upgraded
>>    -
>>
>>    Maps and Toolhub are redeployed first to minimize downtime
>>    -
>>
>>    Maps is switched back to eqiad, perturbations end
>>    -
>>
>>    Toolhub downtime stops
>>    -
>>
>>    SRE ServiceOps redeploys all target cluster services
>>    -
>>
>>    Maintenance end is announced on #wikimedia-operations and as reply to
>>    this email chain
>>    -
>>
>>    Deployments resume
>>
>> Rationale:
>>
>> The date was chosen for convenience as due to the data center switchover
>> process <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad
>> is currently fully depooled, receiving almost no traffic. eqiad is
>> scheduled to be repooled on 2025-10-02
>> <https://zonestamp.toolforge.org/1759417200>, which would complicate the
>> upgrade. With eqiad already drained, we expect no visible user impact.
>>
>> SRE ServiceOps will be checking that all services can be safely deployed
>> before the maintenance, and will be redeploying all services before marking
>> the cluster as usable. Deployers are not required to  re-deploy their
>> services, unless they have been informed to do so by SRE ServiceOps.
>>
>> During last week’s switchover <https://phabricator.wikimedia.org/T399891>,
>> Toolhub remained in eqiad. This means that there will be an expected
>> unavoidable small downtime of a few hours. To minimize Toolhub’s downtime,
>> we will prioritize its redeployment during the initialization phase.
>>
>> As part of the work to upgrade the Maps infrastructure
>> <https://phabricator.wikimedia.org/T381565> and bring the kartotherian
>> service to Wikikube, kartotherian is currently single-homed in eqiad
>> Wikikube, using the old buster-based stack as a backend. The new
>> bookworm-based stack in codfw is being brought up quickly, so we will use
>> this maintenance as an opportunity to shift traffic to it (Case 1). In
>> addition, we are also warming up the old buster-based stack in codfw so we
>> can fall back to it in case issues arise (Case 2). As of 15 minutes before
>> the maintenance, the OSM map tile cache is still being refreshed. There
>> are a lot of elements to fetch and we couldn't make that happen before the
>> upgrade. This refresh will keep happening during the migration, so the
>> amount of stale tiles served will go down as time passes. We decided this
>> was the best of the three options available to us, the other two being
>> depooling the service entirely and having maps be unavailable for the
>> duration of the maintenance, and pushing the date of the upgrade in the
>> future, which would snowball into pushing back the eqiad repool.
>>
>> Thank you for your understanding and support! If you have any questions
>> regarding this process, please respond to this email, comment on
>> Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31
>> <https://phabricator.wikimedia.org/T405703>, or reach out directly to me
>> (IRC nickname claime on #wikimedia-serviceops
>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>).
>>
>> On behalf of SRE ServiceOps,
>>
>>
>> On Tue, Sep 30, 2025 at 4:54 PM Clément Goubert <[email protected]>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> A quick update on additional impact for the upcoming maintenance. Fully
>>> updated maintenance description at the end of this email.
>>>
>>> Short version:
>>>
>>> The Maps infrastructure may experience some perturbation during this
>>> maintenance.
>>>
>>> Impact:
>>>
>>> Users:
>>>
>>>    -
>>>
>>>    Case 1: The new bookworm-based codfw stack performs well and service
>>>    disruption should be minimal
>>>    -
>>>
>>>    Case 2: If errors are experienced with the new codfw stack, the
>>>    fallback to the old codfw stack will come with some OSM-data lag, as yet
>>>    unmeasurable
>>>
>>> Mitigation:
>>>
>>>    -
>>>
>>>    Maps will be redeployed with the same priority as Toolhub to
>>>    minimize downtime.
>>>
>>> Rationale:
>>>
>>> As part of the work to upgrade the Maps infrastructure
>>> <https://phabricator.wikimedia.org/T381565> and bring the kartotherian
>>> service to Wikikube, kartotherian is currently single-homed in eqiad
>>> Wikikube, using the old buster-based stack as a backend.
>>>
>>> The new bookworm-based stack in codfw is being brought up quickly, so we
>>> will use this maintenance as an opportunity to shift traffic to it (case
>>> 1). In addition, we are also warming up the old buster-based stack in codfw
>>> so we can fall back to it in case issues arise (case 2).
>>>
>>> ---
>>>
>>> Object: Kubernetes upgrade to 1.31
>>>
>>> Target: eqiad Wikikube cluster
>>>
>>> Maintenance window: 2025-10-01 10:00
>>> <https://zonestamp.toolforge.org/1759312800>-15:00
>>> <https://zonestamp.toolforge.org/1759330800> UTC
>>>
>>> Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to
>>> kubernetes 1.31 <https://phabricator.wikimedia.org/T405703>
>>>
>>> Operational channel: IRC #wikimedia-sre
>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>,
>>> announcements will be made to IRC #wikimedia-operations
>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations>
>>>
>>> Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops
>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>)
>>>
>>> Impact:
>>>
>>> Users:
>>>
>>>    -
>>>
>>>    Toolhub will be down for the duration of the window.
>>>    -
>>>
>>>       Maps may experience some perturbation during this maintenance.
>>>
>>>
>>>    -
>>>
>>>    No user impact for other services
>>>
>>> Deployers:
>>>
>>>    -
>>>
>>>    Deployments to the target cluster will be unavailable. This includes
>>>    MediaWiki backports and deployments. DO NOT DEPLOY.
>>>    -
>>>
>>>    The following deployment windows are cancelled:
>>>    -
>>>
>>>       Services: Citoid/Zotero 11:00 UTC
>>>       <https://zonestamp.toolforge.org/1759316400>
>>>       -
>>>
>>>       UTC Afternoon Backport Window 13:00 UTC
>>>       <https://zonestamp.toolforge.org/1759330800>
>>>       -
>>>
>>>       Wikifunctions Services UTC Afternoon 14:00 UTC
>>>       <https://zonestamp.toolforge.org/1759327200>
>>>
>>> Process:
>>>
>>> All steps handled by SRE ServiceOps
>>>
>>>    -
>>>
>>>    Maintenance start is announced on #wikimedia-operations and as reply
>>>    to this email chain
>>>    -
>>>
>>>    All deployments are stopped
>>>    -
>>>
>>>    SRE ServiceOps ensures all current versions of deployments can be
>>>    safely deployed
>>>    -
>>>
>>>    Maintenance begins and should take a couple of hours
>>>    -
>>>
>>>    Maps is switched over to codfw new stack, perturbations may start
>>>    -
>>>
>>>    Toolhub downtime starts
>>>    -
>>>
>>>    Possible Maps fallback to codfw old stack
>>>    -
>>>
>>>    Cluster is wiped and upgraded
>>>    -
>>>
>>>    Maps and Toolhub are redeployed first to minimize downtime
>>>    -
>>>
>>>    Maps is switched back to eqiad, perturbations end
>>>    -
>>>
>>>    Toolhub downtime stops
>>>    -
>>>
>>>    SRE ServiceOps redeploys all target cluster services
>>>    -
>>>
>>>    Maintenance end is announced on #wikimedia-operations and as reply
>>>    to this email chain
>>>    -
>>>
>>>    Deployments resume
>>>
>>> Rationale:
>>>
>>> The date was chosen for convenience as due to the data center
>>> switchover process
>>> <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad is
>>> currently fully depooled, receiving almost no traffic. eqiad is scheduled
>>> to be repooled on 2025-10-02
>>> <https://zonestamp.toolforge.org/1759417200>, which would complicate
>>> the upgrade. With eqiad already drained, we expect no visible user impact.
>>>
>>> SRE ServiceOps will be checking that all services can be safely deployed
>>> before the maintenance, and will be redeploying all services before marking
>>> the cluster as usable. Deployers are not required to  re-deploy their
>>> services, unless they have been informed to do so by SRE ServiceOps.
>>>
>>> During last week’s switchover
>>> <https://phabricator.wikimedia.org/T399891>, Toolhub remained in eqiad.
>>> This means that there will be an expected unavoidable small downtime of a
>>> few hours. To minimize Toolhub’s downtime, we will prioritize its
>>> redeployment during the initialization phase.
>>>
>>> As part of the work to upgrade the Maps infrastructure
>>> <https://phabricator.wikimedia.org/T381565> and bring the kartotherian
>>> service to Wikikube, kartotherian is currently single-homed in eqiad
>>> Wikikube, using the old buster-based stack as a backend. The new
>>> bookworm-based stack in codfw is being brought up quickly, so we will use
>>> this maintenance as an opportunity to shift traffic to it (Case 1). In
>>> addition, we are also warming up the old buster-based stack in codfw so we
>>> can fall back to it in case issues arise (Case 2).
>>>
>>> Thank you for your understanding and support! If you have any questions
>>> regarding this process, please respond to this email, comment on
>>> Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31
>>> <https://phabricator.wikimedia.org/T405703>, or reach out directly to
>>> me (IRC nickname claime on #wikimedia-serviceops
>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>).
>>>
>>> On behalf of SRE ServiceOps,
>>>
>>> On Mon, Sep 29, 2025 at 5:37 PM Clément Goubert <[email protected]>
>>> wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> Short version:
>>>>
>>>> We will be upgrading the eqiad Wikikube kubernetes
>>>> <https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#WikiKube>
>>>> cluster to 1.31 on Wednesday 2025-10-01 starting at 10:00 UTC
>>>> <https://zonestamp.toolforge.org/1759312800>, ending at 15:00 UTC
>>>> <https://zonestamp.toolforge.org/1759330800>.
>>>>
>>>> Toolhub will be down during this maintenance.
>>>>
>>>> If you are deploying services to the eqiad Wikikube kubernetes cluster:
>>>>
>>>>    -
>>>>
>>>>    Deployments will be unavailable during the maintenance. DO NOT
>>>>    DEPLOY.
>>>>    -
>>>>
>>>>    SRE will redeploy all services
>>>>    -
>>>>
>>>>    SRE will announce the end of maintenance, at which point the
>>>>    cluster will be usable again
>>>>
>>>> ---
>>>>
>>>> Object: Kubernetes upgrade to 1.31
>>>>
>>>> Target: eqiad Wikikube cluster
>>>>
>>>> Maintenance window: 2025-10-01 10:00
>>>> <https://zonestamp.toolforge.org/1759312800>-15:00
>>>> <https://zonestamp.toolforge.org/1759330800> UTC
>>>>
>>>> Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to
>>>> kubernetes 1.31 <https://phabricator.wikimedia.org/T405703>
>>>>
>>>> Operational channel: IRC #wikimedia-sre
>>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>,
>>>> announcements will be made to IRC #wikimedia-operations
>>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations>
>>>>
>>>> Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops
>>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>)
>>>>
>>>> Impact:
>>>>
>>>> Users:
>>>>
>>>>    -
>>>>
>>>>    Toolhub will be down for the duration of the window.
>>>>    -
>>>>
>>>>    No user impact for other services.
>>>>
>>>> Deployers:
>>>>
>>>>    -
>>>>
>>>>    Deployments to the target cluster will be unavailable. This
>>>>    includes MediaWiki backports and deployments. DO NOT DEPLOY.
>>>>    -
>>>>
>>>>    The following deployment windows are cancelled:
>>>>    -
>>>>
>>>>       Services: Citoid/Zotero 11:00 UTC
>>>>       <https://zonestamp.toolforge.org/1759316400>
>>>>       -
>>>>
>>>>       UTC Afternoon Backport Window 13:00 UTC
>>>>       <https://zonestamp.toolforge.org/1759330800>
>>>>       -
>>>>
>>>>       Wikifunctions Services UTC Afternoon 14:00 UTC
>>>>       <https://zonestamp.toolforge.org/1759327200>
>>>>
>>>> Process:
>>>>
>>>> All steps handled by SRE ServiceOps
>>>>
>>>>    -
>>>>
>>>>    Maintenance start is announced on #wikimedia-operations and as
>>>>    reply to this email chain
>>>>    -
>>>>
>>>>    All deployments are stopped
>>>>    -
>>>>
>>>>    SRE ServiceOps ensures all current versions of deployments can be
>>>>    safely deployed
>>>>    -
>>>>
>>>>    Maintenance begins and should take a couple of hours
>>>>    -
>>>>
>>>>    Toolhub downtime starts
>>>>    -
>>>>
>>>>    Cluster is wiped and upgraded
>>>>    -
>>>>
>>>>    Toolhub is redeployed first to minimize downtime
>>>>    -
>>>>
>>>>    Toolhub downtime stops
>>>>    -
>>>>
>>>>    SRE ServiceOps redeploys all target cluster services
>>>>    -
>>>>
>>>>    Maintenance end is announced on #wikimedia-operations and as reply
>>>>    to this email chain
>>>>    -
>>>>
>>>>    Deployments resume
>>>>
>>>> Rationale:
>>>>
>>>> The date was chosen for convenience as due to the data center
>>>> switchover process
>>>> <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad is
>>>> currently fully depooled, receiving almost no traffic. eqiad is scheduled
>>>> to be repooled on 2025-10-02
>>>> <https://zonestamp.toolforge.org/1759417200>, which would complicate
>>>> the upgrade. With eqiad already drained, we expect no visible user impact.
>>>>
>>>> SRE ServiceOps will be checking that all services can be safely
>>>> deployed before the maintenance, and will be redeploying all services
>>>> before marking the cluster as usable. Deployers are not required to
>>>> re-deploy their services, unless they have been informed to do so by SRE
>>>> ServiceOps.
>>>>
>>>> During last week’s switchover
>>>> <https://phabricator.wikimedia.org/T399891>, Toolhub remained in
>>>> eqiad. This means that there will be an expected unavoidable small downtime
>>>> of a few hours. To minimize Toolhub’s downtime, we will prioritize its
>>>> redeployment during the initialization phase.
>>>>
>>>>
>>>>
>>>> Thank you for your understanding and support! If you have any questions
>>>> regarding this process, please respond to this email, comment on
>>>> Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31
>>>> <https://phabricator.wikimedia.org/T405703>, or reach out directly to
>>>> me (IRC nickname claime on #wikimedia-serviceops
>>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>).
>>>>
>>>> On behalf of SRE ServiceOps,
>>>>
>>>> --
>>>> Clément 'claime' Goubert (they/them)
>>>> Senior SRE
>>>> Wikimedia Foundation
>>>>
>>>
>>>
>>> --
>>> Clément 'claime' Goubert (they/them)
>>> Senior SRE
>>> Wikimedia Foundation
>>>
>>
>>
>> --
>> Clément 'claime' Goubert (they/them)
>> Senior SRE
>> Wikimedia Foundation
>>
>
>
> --
> Clément 'claime' Goubert (they/them)
> Senior SRE
> Wikimedia Foundation
>


-- 
Clément 'claime' Goubert (they/them)
Senior SRE
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to