dcausse added a comment.
In T293063#7491903 <https://phabricator.wikimedia.org/T293063#7491903>, @JMeybohm wrote: > @dcausse IIRC we said that "something in the areas of hours" would be considered a "short maintenance" and thus would not need any additional actions to be carried out, right? We are targeting a SLO with an update lag below 10minutes for 99% of the time, we are still learning what is the operational cost of this and are happy to discuss/re-adjust all this depending on your constraints. > As part of T251305 <https://phabricator.wikimedia.org/T251305> we will re-create the helm release of flink in both datacenters (one after the other ofc.) and that would mean flink will be down for a couple of minutes. If my memory and understanding is still intact, the checkpoint/tombstone metadata is not part of the helm release itself (it's in those flink managed configmaps). So it should survive purging and recreating the helm release. Yes if the configmaps are kept flink will just autorestart on its own, regarding lag I'm not worried as already flink restarts on its own from time to time without affecting the 10min lag SLO. > @Jelto has alredy done that for the staging flink release. If you have the chance it would be nice if you could double check that is still working as expected. Checking the logs I see 2 restarts in the last 7 days and both restarts properly restored the job: Nov 3, 2021 @ 15:44:33.739 syslog kubestage1002 Restoring job 095b671d83457ebf4c59166fda7a7055 from Checkpoint 106609 @ 1635954210959 for 095b671d83457ebf4c59166fda7a7055 located at swift://rdf-streaming-updater-staging.thanos-swift/wikidata/checkpoints/095b671d83457ebf4c59166fda7a7055/chk-106609. Nov 4, 2021 @ 13:36:35.097 syslog kubestage1002 Restoring job 095b671d83457ebf4c59166fda7a7055 from Checkpoint 109216 @ 1636032918483 for 095b671d83457ebf4c59166fda7a7055 located at swift://rdf-streaming-updater-staging.thanos-swift/wikidata/checkpoints/095b671d83457ebf4c59166fda7a7055/chk-109216. So, if one of these restarts corresponds to the helm 3 upgrade then I can confirm that it will work properly the production clusters. > Besides that I tried to understand what would be needed to do for a "longer downtime" of k8s and it's not exactly clear to me. Could we have a dedicated section for that on whe wikitech page? IIRC that also needed a change to WQDS itself. Certainly, this task is all about clarifying all this. TASK DETAIL https://phabricator.wikimedia.org/T293063 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: JMeybohm, Jelto, Aklapper, jijiki, dcausse, Invadibot, MPhamWMF, GeminiAgaloos, maantietaja, wkandek, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Addshore, Mbch331, Dzahn
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org