dcausse added a comment.

  In T293063#7491903 <https://phabricator.wikimedia.org/T293063#7491903>, 
@JMeybohm wrote:
  
  > @dcausse IIRC we said that "something in the areas of hours" would be 
considered a "short maintenance" and thus would not need any additional actions 
to be carried out, right?
  
  We are targeting a SLO with an update lag below 10minutes for 99% of the 
time, we are still learning what is the operational cost of this and are happy 
to discuss/re-adjust all this depending on your constraints.
  
  > As part of T251305 <https://phabricator.wikimedia.org/T251305> we will 
re-create the helm release of flink in both datacenters (one after the other 
ofc.) and that would mean flink will be down for a couple of minutes. If my 
memory and understanding is still intact, the checkpoint/tombstone metadata is 
not part of the helm release itself (it's in those flink managed configmaps). 
So it should survive purging and recreating the helm release.
  
  Yes if the configmaps are kept flink will just autorestart on its own, 
regarding lag I'm not worried as already flink restarts on its own from time to 
time without affecting the 10min lag SLO.
  
  > @Jelto has alredy done that for the staging flink release. If you have the 
chance it would be nice if you could double check that is still working as 
expected.
  
  Checking the logs I see 2 restarts in the last 7 days and both restarts 
properly restored the job:
  
    Nov 3, 2021 @ 15:44:33.739  syslog  kubestage1002   Restoring job 
095b671d83457ebf4c59166fda7a7055 from Checkpoint 106609 @ 1635954210959 for 
095b671d83457ebf4c59166fda7a7055 located at 
swift://rdf-streaming-updater-staging.thanos-swift/wikidata/checkpoints/095b671d83457ebf4c59166fda7a7055/chk-106609.
    
    Nov 4, 2021 @ 13:36:35.097  syslog  kubestage1002   Restoring job 
095b671d83457ebf4c59166fda7a7055 from Checkpoint 109216 @ 1636032918483 for 
095b671d83457ebf4c59166fda7a7055 located at 
swift://rdf-streaming-updater-staging.thanos-swift/wikidata/checkpoints/095b671d83457ebf4c59166fda7a7055/chk-109216.
  
  So, if one of these restarts corresponds to the helm 3 upgrade then I can 
confirm that it will work properly the production clusters.
  
  > Besides that I tried to understand what would be needed to do for a "longer 
downtime" of k8s and it's not exactly clear to me. Could we have a dedicated 
section for that on whe wikitech page? IIRC that also needed a change to WQDS 
itself.
  
  Certainly, this task is all about clarifying all this.

TASK DETAIL
  https://phabricator.wikimedia.org/T293063

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: JMeybohm, Jelto, Aklapper, jijiki, dcausse, Invadibot, MPhamWMF, 
GeminiAgaloos, maantietaja, wkandek, CBogen, Akuckartz, Nandana, Namenlos314, 
Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Addshore, Mbch331, Dzahn
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to