[Wikidata-bugs] [Maniphest] T350784: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T350784 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking, JMeybohm Cc: dcausse, JMeybohm, Aklapper, bking, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, S8321414, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, BTullis, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362084: shellbox-constraints returning 500 on preg_match error
JMeybohm added a comment. In T362084#9700057 <https://phabricator.wikimedia.org/T362084#9700057>, @Lucas_Werkmeister_WMDE wrote: > Can someone clarify what the problem here is? From WBQC’s perspective, it’s totally expected that some of these regex checks will fail (though there’s some confusion about which shellbox errors we should or shouldn’t try to catch, see T304084 <https://phabricator.wikimedia.org/T304084> and especially T304084#8561863 <https://phabricator.wikimedia.org/T304084#8561863>). But we might need to make some changes to keep the service mesh monitoring happy? (“exceeding retry limit” also sounds concerning – we don’t really want these requests to be retried, I think.) The reason I've opened the task is that the stream of errors/retries it looked fishy and I wanted to make sure we have an understanding of what is going on. From my naive POV I would argue that a preg_match error should not be classified as a an internal server error, as it results from (invalid) user input. If this is the desired response in that case, we should probably, as you said, configure the service-mesh to not retry requests on 500 errors - as that does not make any sense then. TASK DETAIL https://phabricator.wikimedia.org/T362084 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Lucas_Werkmeister_WMDE, Reedy, akosiaris, Clement_Goubert, JMeybohm, Aklapper, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Kappakayala, Adamm71, S8321414, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 786, Arnoldokoth, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, wkandek, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, darthmon_wmde, Eihel, Rosalie_WMDE, Kent7301, joker88john, CucyNoiD, Nandana, jijiki, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, KimKelting, Esc3300, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Agabi10, Neuronton, Scott_WUaS, Verdy_p, abian, Wikidata-bugs, aude, Jdforrester-WMF, Mbch331, Jay8g, Legoktm ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T350784: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T350784 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: dcausse, JMeybohm, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes
JMeybohm added a comment. In T293063#8582600 <https://phabricator.wikimedia.org/T293063#8582600>, @dcausse wrote: > Hey, clarified this a bit, renamed it to "Hard depool/re-pool", yes in this method the jobs should start right after the helm deploy, the jar is stored in swift so no need to deploy it manually. Cool, thanks. That would make it hands-off for anybody but sre/serviceops which ofc would be nice. Anyhow. AIUI this process will be more or less the same for flink deployments managed by the flink operator. It would be nice if you could verify this during your tests with the operator (I'm happy to help/pair ofc.) or if there maybe even is a better option in flink-operator world. TASK DETAIL https://phabricator.wikimedia.org/T293063 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: akosiaris, RKemper, Gehel, bking, JMeybohm, Jelto, Aklapper, jijiki, dcausse, Astuthiodit_1, AWesterinen, Arnoldokoth, karapayneWMDE, Invadibot, MPhamWMF, GeminiAgaloos, maantietaja, wkandek, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, Dzahn ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes
JMeybohm added a comment. Hey @dcausse, I'm reading this again because of the upcoming k8s 1.23 upgrade and was wondering: In "To restore:" section of "Alternate actions (not fully untested):" - do we need to start the job somehow as well, specifying which jar file to use? Or is that information part of the configmaps/safepoint and the job can start automatically without submitting a jar? TASK DETAIL https://phabricator.wikimedia.org/T293063 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: akosiaris, RKemper, Gehel, bking, JMeybohm, Jelto, Aklapper, jijiki, dcausse, Astuthiodit_1, AWesterinen, Arnoldokoth, karapayneWMDE, Invadibot, MPhamWMF, GeminiAgaloos, maantietaja, wkandek, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, Dzahn ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model
JMeybohm added a project: serviceops-radar. Restricted Application added a project: wdwb-tech. TASK DETAIL https://phabricator.wikimedia.org/T326409 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: BTullis, JMeybohm, gmodena, Ottomata, bking, Aklapper, dcausse, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Arnoldokoth, Biggs657, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, Juan90264, wkandek, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, jijiki, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T326409 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: BTullis, JMeybohm, gmodena, Ottomata, bking, Aklapper, dcausse, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T293063 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: RKemper, Gehel, bking, JMeybohm, Jelto, Aklapper, jijiki, dcausse, Astuthiodit_1, AWesterinen, Arnoldokoth, karapayneWMDE, Invadibot, MPhamWMF, GeminiAgaloos, maantietaja, wkandek, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, Dzahn ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T293063 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: RKemper, Gehel, bking, JMeybohm, Jelto, Aklapper, jijiki, dcausse, Astuthiodit_1, AWesterinen, Arnoldokoth, karapayneWMDE, Invadibot, MPhamWMF, GeminiAgaloos, maantietaja, wkandek, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, Dzahn ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)
JMeybohm added a comment. In T301147#7821813 <https://phabricator.wikimedia.org/T301147#7821813>, @dcausse wrote: > The additional PODs won't be used as a flink job does not automatically scale so it would be a pure waste of resources (2.5G of reserved mem per additional POD). It would help I guess to improve redundancy in this scenario only if k8s assigns every POD to a distinct machine, in which case even with a single machine misbehaving flink would have enough redundancy to allocate the job to the spare POD. If k8s does do allocation randomly or that there are not enough k8s worker nodes (1 spare POD in our case would mean spreading the PODs over 8 different machines) then it's probably not worth the waste of resources. K8s will try to schedule replicas of one Deployment onto different Nodes by default and we can also force it to do so. But tbh I would not so that in this case as in most of the cases it should be just fine. I expect this situation to be a rare exception (and I probably jinxed that now) as we have not seen it before or happen again. So as long as it's not super critical, I would refrain from trying to optimize the workload for this type of failure. Ultimately this should be taken care of by k8s so we should invest there - especially if should happen again. TASK DETAIL https://phabricator.wikimedia.org/T301147 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)
JMeybohm added a comment. > To be discussed with service ops: > > - Investigate and address the reasons why after a node failure k8s did not fulfill its promise of making sure that the rdf-streaming-updater deployment have 6 working replicas The problem was more that the node did not really fail (to it's complete extend). It was heavily overloaded (for an unknown reason) and that's potentially why containers/processed running there seemed dead. But from K8s perspective the Pods where still running and a new pod was scheduled as soon as I power cycled the node (e.g. K8s was able to detect a mismatch in desired end existing replicas). > - If the above is not possible could we mitigate this problem by over-allocating resources (increase the number of replicas) to the deployment to increase the chances of proper recovery if this situation happens again? If that makes sense from your POV you could do that ofc. I can't speak on how problematic this situation was compared to the potential waste of resources another pod means. But if the current workload is already maxing out the capacity of the 6 replicas you have, maybe bumping that to 7 might be smart anyways to account for peaks? In T301147#7821422 <https://phabricator.wikimedia.org/T301147#7821422>, @dcausse wrote: > @JMeybohm do you see any additional action items that would improve the resilience of k8s in such scenario? Unfortunately we don't have any data on what went wrong on the node. I think T277876 <https://phabricator.wikimedia.org/T277876> would be a step in the right direction but I also doubt it would have fully prevented this issue (ultimately I can't say). TASK DETAIL https://phabricator.wikimedia.org/T301147 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)
JMeybohm added a comment. In T301147#7689837 <https://phabricator.wikimedia.org/T301147#7689837>, @dcausse wrote: > @JMeybohm we're still investigating why the application did not properly recover while kubernetes1014 went down but if you have ideas on the two questions in the ticket description this would be very helpful, thanks! Unfortunately I'm not exactly sure what happened to the node. What I know is that the system load surged (potentially due to high iowait) on the system, leaving running processes practically starving but the system was still responding to ICMP and kubernetes status heartbeats still (mostly) worked. Leaving the node flipping between Ready/NotReady state. That means the node was not actually down from k8s POV, which is why no new Pods where created until I drained the node respectively before I powercycled it (as evicting pods was actually hanging as well, as k8s tries to be nice and the node still was in it's overloaded state). TASK DETAIL https://phabricator.wikimedia.org/T301147 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Addshore, JMeybohm, Michael, Aklapper, dcausse, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T280485: Additional capacity on the k8s Flink cluster for WCQS updater
JMeybohm added a comment. I'd opt for "reuse the same [flink] cluster" from the perspective that we treat this snowflaky-ish in the k8s clusters. So less flink-clusters means less snowflakes (at some point it does become a snowball, right? ). TASK DETAIL https://phabricator.wikimedia.org/T280485 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Gehel, JMeybohm Cc: JMeybohm, dcausse, akosiaris, Zbyszko, Aklapper, RKemper, Gehel, MPhamWMF, wkandek, CBogen, Namenlos314, jijiki, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Dzahn ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes
JMeybohm added subscribers: Jelto, JMeybohm. JMeybohm added a comment. @dcausse IIRC we said that "something in the areas of hours" would be considered a "short maintenance" and thus would not need any additional actions to be carried out, right? As part of T251305 <https://phabricator.wikimedia.org/T251305> we will re-create the helm release of flink in both datacenters (one after the other ofc.) and that would mean flink will be down for a couple of minutes. If my memory and understanding is still intact, the checkpoint/tombstone metadata is not part of the helm release itself (it's in those flink managed configmaps). So it should survive purging and recreating the helm release. @Jelto has alredy done that for the staging flink release. If you have the chance it would be nice if you could double check that is still working as expected. Besides that I tried to understand what would be needed to do for a "longer downtime" of k8s and it's not exactly clear to me. Could we have a dedicated section for that on whe wikitech page? IIRC that also needed a change to WQDS itself. TASK DETAIL https://phabricator.wikimedia.org/T293063 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: JMeybohm, Jelto, Aklapper, jijiki, dcausse, Invadibot, MPhamWMF, GeminiAgaloos, maantietaja, wkandek, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Addshore, Mbch331, Dzahn ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T287443: Flink jobmanager and taskmanager cannot talk to the k8s api server
JMeybohm closed this task as "Resolved". JMeybohm added a comment. Thanks, closing then. TASK DETAIL https://phabricator.wikimedia.org/T287443 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: JMeybohm, dcausse, Aklapper, Biggs657, Invadibot, Lalamarie69, MPhamWMF, maantietaja, Juan90264, wkandek, Alter-paule, Beast1978, CBogen, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, jijiki, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Addshore, Mbch331, Dzahn ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T287443: Flink jobmanager and taskmanager cannot talk to the k8s api server
JMeybohm added a comment. That is because your application is reading default kubernetes environment variables which carry the ClusterIP of `kubernetes.default.svc.cluster.local` instead of it's name. The ClusterIP we unfortunately don't have in the certificate on the actual servers. Please don't set `kubestagemaster.svc.eqiad.wmnet` as that will only work on one cluster. If flink allows you to override the API servers hostname, please point it to `kubernetes.default.svc.cluster.local` (which works transparently in all clusters. If flink does not allow overriding the hostname, please see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventrouter/templates/_helpers.tpl#57 and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventrouter/templates/deployment.yaml#38 for a workaround. TASK DETAIL https://phabricator.wikimedia.org/T287443 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: JMeybohm, dcausse, Aklapper, Invadibot, MPhamWMF, maantietaja, wkandek, CBogen, Akuckartz, Nandana, Namenlos314, jijiki, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Addshore, Mbch331, Dzahn ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T287443: Flink jobmanager and taskmanager cannot talk to the k8s api server
JMeybohm claimed this task. JMeybohm added a comment. Looking into this. Problem is that we currently do not allow Pods to access the Kubernetes API servers (Egress rule is missing) and it's not super trivial to allow that in a transparent way (e.g. without having to declare the API servers IPs in Kubernetes). TASK DETAIL https://phabricator.wikimedia.org/T287443 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: JMeybohm, dcausse, Aklapper, Invadibot, MPhamWMF, maantietaja, wkandek, CBogen, Akuckartz, Nandana, Namenlos314, jijiki, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Addshore, Mbch331, Dzahn ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T285219: cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503
JMeybohm added a subscriber: RLazarus. JMeybohm added a comment. Picking up from the IRC conversation yesterday @RLazarus figured that the response body looks like it is https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/errorpages/503.html At the time this issue was opened (June 21) we did had some database issues, so increased rate of 503's from apiservers are most likely due to that. Looking at the last two weeks the picture has changed from June to now with only a hand full of requests failing for cxserver in the last two weeks, most of them due to upstream connection failure ("UF" in response flags field). Those errors might happen from time to time due to the service-proxy creating persistent connections which then might get closed server side or due some network issues. But as that is happening at a very low rate, we did not dig more into that by now. TASK DETAIL https://phabricator.wikimedia.org/T285219 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: RLazarus, akosiaris, Addshore, JMeybohm, santhosh, Nikerabbit, Aklapper, KartikMistry, Invadibot, UOzurumba, PallaviPatke, maantietaja, wkandek, Rileych, Nintendofan885, Akuckartz, 50019062, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, chapulina, QZanden, Alfa80, LawExplorer, _jensen, rosalieper, Soum213, Taiwania_Justo, Nizil, Scott_WUaS, Ixocactus, Wikidata-bugs, aude, Amire80, Jsahleen, Arrbee, Mbch331, Dzahn ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T273098: High Availability Flink
JMeybohm added a comment. I do see that using the configmap election method is appealing as it is build in and does not require additional software to function. Unfortunately I was not able to understand (by briefly reading the docs) if this uses a separate configmap or the one that is actually used for configuring flink. While the former would be okay-ish I guess, the latter will potentially cause problems as every deployment will result in a re-creation of said configmap by helm. Resetting it to whatever state the chart has defined. Apart from potentially losing data in that case I'm not 100% certain that helm will handle that properly in every case as I have seen to many weird issues with helm and "manually" altered kubernetes objects. TASK DETAIL https://phabricator.wikimedia.org/T273098 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Mstyles, JMeybohm Cc: Mstyles, dcausse, JMeybohm, jijiki, Aklapper, Gehel, akosiaris, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T276550: Missing alerts for Termbox staging and test services
JMeybohm added a comment. It was more a matter of a day than month (as we just upgraded the kubernetes version in staging). Also we don't enable monitoring for staging in general, but of cause errors like that should be catched at deploy time. This can currently be done by running `helmfile -e test --cleanup` which will run the test defined in the helm chart. Unfortunately this is not done by default and must be triggered manually which we did not after the kubernetes upgrade - sorry for that. I created T276949 <https://phabricator.wikimedia.org/T276949> to have the tests run automatically after the deployment. If this would fit your needs, please feel free to close this task. TASK DETAIL https://phabricator.wikimedia.org/T276550 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: JMeybohm, Tarrow, Addshore, Aklapper, Jakob_WMDE, maantietaja, wkandek, Akuckartz, darthmon_wmde, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, abian, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T264821: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes
JMeybohm triaged this task as "Medium" priority. TASK DETAIL https://phabricator.wikimedia.org/T264821 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Michael, RhinosF1, Joe, LSobanski, Addshore, Ladsgroup, RLazarus, Marostegui, Aklapper, CDanis, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, Legado_Shulgin, Nandana, jijiki, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, Jay8g, fgiunchedi, Dzahn ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers
JMeybohm added a comment. Looking at the values today it's pretty clear that mw1382 wins and mw1381 takes the second place. The overall memory usage looks like it's safe to leave it this way over the weekend. On Monday we should reboot the clusters again, with "cgroup.memory=nokmem". TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: JMeybohm, Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata, wkandek, Akuckartz, darthmon_wmde, WDoranWMF, holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, Jay8g, fgiunchedi, Dzahn ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, WDoranWMF, holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, Jay8g, fgiunchedi, Dzahn ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, WDoranWMF, holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, Jay8g, fgiunchedi, Dzahn ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, WDoranWMF, holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, Jay8g, fgiunchedi, Dzahn ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, WDoranWMF, holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, Jay8g, fgiunchedi, Dzahn ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, WDoranWMF, holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, Jay8g, fgiunchedi, Dzahn ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T255410: Termbox SSR connection terminated very often
JMeybohm added a comment. @Michael thanks for writing this up! So, if it is safe to assume the MW -> termbox timeout is 3s I would suggest we configure the envoys accordingly by setting `tls.upstream_timeout: "3s"` in termbox values.yaml as well as `timeout: "3s"` in appservers envoy config (also lowering `keepalive` there). TASK DETAIL https://phabricator.wikimedia.org/T255410 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: akosiaris, JMeybohm, WMDE-leszek, Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs