[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

dcausse Thu, 31 Mar 2022 07:50:22 -0700

dcausse added a comment.

  Thanks for the quick answer! (response inline)

  In T301147#7821582 <https://phabricator.wikimedia.org/T301147#7821582>, 
@JMeybohm wrote:

  >> - If the above is not possible could we mitigate this problem by 
over-allocating resources (increase the number of replicas) to the deployment 
to increase the chances of proper recovery if this situation happens again?
  >
  > If that makes sense from your POV you could do that ofc. I can't speak on 
how problematic this situation was compared to the potential waste of resources 
another pod means. But if the current workload is already maxing out the 
capacity of the 6 replicas you have, maybe bumping that to 7 might be smart 
anyways to account for peaks?

  The additional PODs won't be used as a flink job does not automatically scale 
so it would be a pure waste of resources (2.5G of reserved mem per additional 
POD). It would help I guess to improve redundancy in this scenario only if k8s 
assigns every POD to a distinct machine, in which case even with a single 
machine misbehaving flink would have enough redundancy to allocate the job to 
the spare POD. If k8s does do allocation randomly or that there are not enough 
k8s worker nodes (1 spare POD in our case would mean spreading the PODs over 8 
different machines) then it's probably not worth the waste of resources.

  > In T301147#7821422 <https://phabricator.wikimedia.org/T301147#7821422>, 
@dcausse wrote:
  >
  >> @JMeybohm do you see any additional action items that would improve the 
resilience of k8s in such scenario?
  >
  > Unfortunately we don't have any data on what went wrong on the node. I 
think T277876 <https://phabricator.wikimedia.org/T277876> would be a step in 
the right direction but I also doubt it would have fully prevented this issue 
(ultimately I can't say).

  Thanks, I'm adding it to the ticket description as a possible improvement.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

Reply via email to