[Wikidata-bugs] [Maniphest] T350784: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator

2024-04-09 Thread JMeybohm
JMeybohm updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T350784

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, JMeybohm
Cc: dcausse, JMeybohm, Aklapper, bking, Danny_Benjafield_WMDE, 
Isabelladantes1983, Themindcoder, Adamm71, S8321414, Jersione, Hellket777, 
LisafBia6531, Astuthiodit_1, AWesterinen, 786, BTullis, Biggs657, 
karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, 
ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, 
Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, 
KimKelting, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, 
Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, 
Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362084: shellbox-constraints returning 500 on preg_match error

2024-04-09 Thread JMeybohm
JMeybohm added a comment.


  In T362084#9700057 <https://phabricator.wikimedia.org/T362084#9700057>, 
@Lucas_Werkmeister_WMDE wrote:
  
  > Can someone clarify what the problem here is? From WBQC’s perspective, it’s 
totally expected that some of these regex checks will fail (though there’s some 
confusion about which shellbox errors we should or shouldn’t try to catch, see 
T304084 <https://phabricator.wikimedia.org/T304084> and especially 
T304084#8561863 <https://phabricator.wikimedia.org/T304084#8561863>). But we 
might need to make some changes to keep the service mesh monitoring happy? 
(“exceeding retry limit” also sounds concerning – we don’t really want these 
requests to be retried, I think.)
  
  The reason I've opened the task is that the stream of errors/retries it 
looked fishy and I wanted to make sure we have an understanding of what is 
going on. From my naive POV I would argue that a preg_match error should not be 
classified as a an internal server error, as it results from (invalid) user 
input. If this is the desired response in that case, we should probably, as you 
said, configure the service-mesh to not retry requests on 500 errors - as that 
does not make any sense then.

TASK DETAIL
  https://phabricator.wikimedia.org/T362084

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: Lucas_Werkmeister_WMDE, Reedy, akosiaris, Clement_Goubert, JMeybohm, 
Aklapper, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Kappakayala, 
Adamm71, S8321414, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 786, 
Arnoldokoth, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, 
wkandek, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, 
darthmon_wmde, Eihel, Rosalie_WMDE, Kent7301, joker88john, CucyNoiD, Nandana, 
jijiki, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, 
GoranSMilovanovic, QZanden, KimKelting, Esc3300, LawExplorer, Lewizho99, 
Maathavan, _jensen, rosalieper, Agabi10, Neuronton, Scott_WUaS, Verdy_p, abian, 
Wikidata-bugs, aude, Jdforrester-WMF, Mbch331, Jay8g, Legoktm
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T350784: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator

2023-11-30 Thread JMeybohm
JMeybohm updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T350784

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: dcausse, JMeybohm, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes

2023-02-02 Thread JMeybohm
JMeybohm added a comment.


  In T293063#8582600 <https://phabricator.wikimedia.org/T293063#8582600>, 
@dcausse wrote:
  
  > Hey, clarified this a bit, renamed it to "Hard depool/re-pool", yes in this 
method the jobs should start right after the helm deploy, the jar is stored in 
swift so no need to deploy it manually.
  
  Cool, thanks. That would make it hands-off for anybody but sre/serviceops 
which ofc would be nice.
  
  Anyhow. AIUI this process will be more or less the same for flink deployments 
managed by the flink operator. It would be nice if you could verify this during 
your tests with the operator (I'm happy to help/pair ofc.) or if there maybe 
even is a better option in flink-operator world.

TASK DETAIL
  https://phabricator.wikimedia.org/T293063

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: akosiaris, RKemper, Gehel, bking, JMeybohm, Jelto, Aklapper, jijiki, 
dcausse, Astuthiodit_1, AWesterinen, Arnoldokoth, karapayneWMDE, Invadibot, 
MPhamWMF, GeminiAgaloos, maantietaja, wkandek, CBogen, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331, Dzahn
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes

2023-02-02 Thread JMeybohm
JMeybohm added a comment.


  Hey @dcausse, I'm reading this again because of the upcoming k8s 1.23 upgrade 
and was wondering:
  In "To restore:" section of "Alternate actions (not fully untested):" - do we 
need to start the job somehow as well, specifying which jar file to use? Or is 
that information part of the configmaps/safepoint and the job can start 
automatically without submitting a jar?

TASK DETAIL
  https://phabricator.wikimedia.org/T293063

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: akosiaris, RKemper, Gehel, bking, JMeybohm, Jelto, Aklapper, jijiki, 
dcausse, Astuthiodit_1, AWesterinen, Arnoldokoth, karapayneWMDE, Invadibot, 
MPhamWMF, GeminiAgaloos, maantietaja, wkandek, CBogen, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331, Dzahn
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model

2023-02-02 Thread JMeybohm
JMeybohm added a project: serviceops-radar.
Restricted Application added a project: wdwb-tech.

TASK DETAIL
  https://phabricator.wikimedia.org/T326409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: BTullis, JMeybohm, gmodena, Ottomata, bking, Aklapper, dcausse, 
Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 
AWesterinen, 786, Arnoldokoth, Biggs657, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, Juan90264, wkandek, Alter-paule, Beast1978, CBogen, ItamarWMDE, 
Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, 
Namenlos314, jijiki, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model

2023-02-02 Thread JMeybohm
JMeybohm updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T326409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: BTullis, JMeybohm, gmodena, Ottomata, bking, Aklapper, dcausse, 
Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 
AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, 
Juan90264, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, 
Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, 
Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, 
_jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes

2022-08-30 Thread JMeybohm
JMeybohm updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T293063

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: RKemper, Gehel, bking, JMeybohm, Jelto, Aklapper, jijiki, dcausse, 
Astuthiodit_1, AWesterinen, Arnoldokoth, karapayneWMDE, Invadibot, MPhamWMF, 
GeminiAgaloos, maantietaja, wkandek, CBogen, ItamarWMDE, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, 
EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, Dzahn
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes

2022-08-30 Thread JMeybohm
JMeybohm updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T293063

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: RKemper, Gehel, bking, JMeybohm, Jelto, Aklapper, jijiki, dcausse, 
Astuthiodit_1, AWesterinen, Arnoldokoth, karapayneWMDE, Invadibot, MPhamWMF, 
GeminiAgaloos, maantietaja, wkandek, CBogen, ItamarWMDE, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, 
EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, Dzahn
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-31 Thread JMeybohm
JMeybohm added a comment.


  In T301147#7821813 <https://phabricator.wikimedia.org/T301147#7821813>, 
@dcausse wrote:
  
  > The additional PODs won't be used as a flink job does not automatically 
scale so it would be a pure waste of resources (2.5G of reserved mem per 
additional POD). It would help I guess to improve redundancy in this scenario 
only if k8s assigns every POD to a distinct machine, in which case even with a 
single machine misbehaving flink would have enough redundancy to allocate the 
job to the spare POD. If k8s does do allocation randomly or that there are not 
enough k8s worker nodes (1 spare POD in our case would mean spreading the PODs 
over 8 different machines) then it's probably not worth the waste of resources.
  
  K8s will try to schedule replicas of one Deployment onto different Nodes by 
default and we can also force it to do so. But tbh I would not so that in this 
case as in most of the cases it should be just fine. I expect this situation to 
be a rare exception (and I probably jinxed that now) as we have not seen it 
before or happen again. So as long as it's not super critical, I would refrain 
from trying to optimize the workload for this type of failure. Ultimately this 
should be taken care of by k8s so we should invest there - especially if should 
happen again.

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-31 Thread JMeybohm
JMeybohm added a comment.


  > To be discussed with service ops:
  >
  > - Investigate and address the reasons why after a node failure k8s did not 
fulfill its promise of making sure that the rdf-streaming-updater deployment 
have 6 working replicas
  
  The problem was more that the node did not really fail (to it's complete 
extend). It was heavily overloaded (for an unknown reason) and that's 
potentially why containers/processed running there seemed dead. But from K8s 
perspective the Pods where still running and a new pod was scheduled as soon as 
I power cycled the node (e.g. K8s was able to detect a mismatch in desired end 
existing replicas).
  
  > - If the above is not possible could we mitigate this problem by 
over-allocating resources (increase the number of replicas) to the deployment 
to increase the chances of proper recovery if this situation happens again?
  
  If that makes sense from your POV you could do that ofc. I can't speak on how 
problematic this situation was compared to the potential waste of resources 
another pod means. But if the current workload is already maxing out the 
capacity of the 6 replicas you have, maybe bumping that to 7 might be smart 
anyways to account for peaks?
  
  In T301147#7821422 <https://phabricator.wikimedia.org/T301147#7821422>, 
@dcausse wrote:
  
  > @JMeybohm do you see any additional action items that would improve the 
resilience of k8s in such scenario?
  
  Unfortunately we don't have any data on what went wrong on the node. I think 
T277876 <https://phabricator.wikimedia.org/T277876> would be a step in the 
right direction but I also doubt it would have fully prevented this issue 
(ultimately I can't say).

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: elukey, akosiaris, Gehel, RKemper, bking, toan, Addshore, JMeybohm, 
Michael, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, 
maantietaja, CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-08 Thread JMeybohm
JMeybohm added a comment.


  In T301147#7689837 <https://phabricator.wikimedia.org/T301147#7689837>, 
@dcausse wrote:
  
  > @JMeybohm we're still investigating why the application did not properly 
recover while kubernetes1014 went down but if you have ideas on the two 
questions in the ticket description this would be very helpful, thanks!
  
  Unfortunately I'm not exactly sure what happened to the node. What I know is 
that the system load surged (potentially due to high iowait) on the system, 
leaving running processes practically starving but the system was still 
responding to ICMP and kubernetes status heartbeats still (mostly) worked. 
Leaving the node flipping between Ready/NotReady state.
  That means the node was not actually down from k8s POV, which is why no new 
Pods where created until I drained the node respectively before I powercycled 
it (as evicting pods was actually hanging as well, as k8s tries to be nice and 
the node still was in it's overloaded state).

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: Addshore, JMeybohm, Michael, Aklapper, dcausse, Invadibot, MPhamWMF, 
maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T280485: Additional capacity on the k8s Flink cluster for WCQS updater

2021-11-17 Thread JMeybohm
JMeybohm added a comment.


  I'd opt for "reuse the same [flink] cluster" from the perspective that we 
treat this snowflaky-ish in the k8s clusters. So less flink-clusters means less 
snowflakes (at some point it does become a snowball, right?  ).

TASK DETAIL
  https://phabricator.wikimedia.org/T280485

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel, JMeybohm
Cc: JMeybohm, dcausse, akosiaris, Zbyszko, Aklapper, RKemper, Gehel, MPhamWMF, 
wkandek, CBogen, Namenlos314, jijiki, Gq86, Lucas_Werkmeister_WMDE, EBjune, 
merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Dzahn
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes

2021-11-09 Thread JMeybohm
JMeybohm added subscribers: Jelto, JMeybohm.
JMeybohm added a comment.


  @dcausse IIRC we said that "something in the areas of hours" would be 
considered a "short maintenance" and thus would not need any additional actions 
to be carried out, right?
  As part of T251305 <https://phabricator.wikimedia.org/T251305> we will 
re-create the helm release of flink in both datacenters (one after the other 
ofc.) and that would mean flink will be down for a couple of minutes. If my 
memory and understanding is still intact, the checkpoint/tombstone metadata is 
not part of the helm release itself (it's in those flink managed configmaps). 
So it should survive purging and recreating the helm release.
  @Jelto has alredy done that for the staging flink release. If you have the 
chance it would be nice if you could double check that is still working as 
expected.
  
  Besides that I tried to understand what would be needed to do for a "longer 
downtime" of k8s and it's not exactly clear to me. Could we have a dedicated 
section for that on whe wikitech page? IIRC that also needed a change to WQDS 
itself.

TASK DETAIL
  https://phabricator.wikimedia.org/T293063

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: JMeybohm, Jelto, Aklapper, jijiki, dcausse, Invadibot, MPhamWMF, 
GeminiAgaloos, maantietaja, wkandek, CBogen, Akuckartz, Nandana, Namenlos314, 
Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Addshore, Mbch331, Dzahn
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T287443: Flink jobmanager and taskmanager cannot talk to the k8s api server

2021-07-28 Thread JMeybohm
JMeybohm closed this task as "Resolved".
JMeybohm added a comment.


  Thanks, closing then.

TASK DETAIL
  https://phabricator.wikimedia.org/T287443

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: JMeybohm, dcausse, Aklapper, Biggs657, Invadibot, Lalamarie69, MPhamWMF, 
maantietaja, Juan90264, wkandek, Alter-paule, Beast1978, CBogen, Un1tY, 
Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, 
jijiki, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Addshore, Mbch331, Dzahn
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T287443: Flink jobmanager and taskmanager cannot talk to the k8s api server

2021-07-28 Thread JMeybohm
JMeybohm added a comment.


  That is because your application is reading default kubernetes environment 
variables which carry the ClusterIP of `kubernetes.default.svc.cluster.local` 
instead of it's name. The ClusterIP we unfortunately don't have in the 
certificate on the actual servers.
  
  Please don't set `kubestagemaster.svc.eqiad.wmnet` as that will only work on 
one cluster. If flink allows you to override the API servers hostname, please 
point it to `kubernetes.default.svc.cluster.local` (which works transparently 
in all clusters. If flink does not allow overriding the hostname, please see 
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventrouter/templates/_helpers.tpl#57
 and 
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventrouter/templates/deployment.yaml#38
 for a workaround.

TASK DETAIL
  https://phabricator.wikimedia.org/T287443

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: JMeybohm, dcausse, Aklapper, Invadibot, MPhamWMF, maantietaja, wkandek, 
CBogen, Akuckartz, Nandana, Namenlos314, jijiki, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Addshore, Mbch331, Dzahn
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T287443: Flink jobmanager and taskmanager cannot talk to the k8s api server

2021-07-27 Thread JMeybohm
JMeybohm claimed this task.
JMeybohm added a comment.


  Looking into this.
  Problem is that we currently do not allow Pods to access the Kubernetes API 
servers (Egress rule is missing) and it's not super trivial to allow that in a 
transparent way (e.g. without having to declare the API servers IPs in 
Kubernetes).

TASK DETAIL
  https://phabricator.wikimedia.org/T287443

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: JMeybohm, dcausse, Aklapper, Invadibot, MPhamWMF, maantietaja, wkandek, 
CBogen, Akuckartz, Nandana, Namenlos314, jijiki, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Addshore, Mbch331, Dzahn
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T285219: cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503

2021-07-19 Thread JMeybohm
JMeybohm added a subscriber: RLazarus.
JMeybohm added a comment.


  Picking up from the IRC conversation yesterday @RLazarus figured that the 
response body looks like it is 
https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/errorpages/503.html
  At the time this issue was opened (June 21) we did had some database issues, 
so increased rate of 503's from apiservers are most likely due to that.
  
  Looking at the last two weeks the picture has changed from June to now with 
only a hand full of requests failing for cxserver in the last two weeks, most 
of them due to upstream connection failure ("UF" in response flags field). 
Those errors might happen from time to time due to the service-proxy creating 
persistent connections which then might get closed server side or due some 
network issues. But as that is happening at a very low rate, we did not dig 
more into that by now.

TASK DETAIL
  https://phabricator.wikimedia.org/T285219

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: RLazarus, akosiaris, Addshore, JMeybohm, santhosh, Nikerabbit, Aklapper, 
KartikMistry, Invadibot, UOzurumba, PallaviPatke, maantietaja, wkandek, 
Rileych, Nintendofan885, Akuckartz, 50019062, Nandana, jijiki, Lahi, Gq86, 
GoranSMilovanovic, chapulina, QZanden, Alfa80, LawExplorer, _jensen, 
rosalieper, Soum213, Taiwania_Justo, Nizil, Scott_WUaS, Ixocactus, 
Wikidata-bugs, aude, Amire80, Jsahleen, Arrbee, Mbch331, Dzahn
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T273098: High Availability Flink

2021-04-14 Thread JMeybohm
JMeybohm added a comment.


  I do see that using the configmap election method is appealing as it is build 
in and does not require additional software to function. Unfortunately I was 
not able to understand (by briefly reading the docs) if this uses a separate 
configmap or the one that is actually used for configuring flink.
  While the former would be okay-ish I guess, the latter will potentially cause 
problems as every deployment will result in a re-creation of said configmap by 
helm. Resetting it to whatever state the chart has defined.
  Apart from potentially losing data in that case I'm not 100% certain that 
helm will handle that properly in every case as I have seen to many weird 
issues with helm and "manually" altered kubernetes objects.

TASK DETAIL
  https://phabricator.wikimedia.org/T273098

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mstyles, JMeybohm
Cc: Mstyles, dcausse, JMeybohm, jijiki, Aklapper, Gehel, akosiaris, Invadibot, 
MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T276550: Missing alerts for Termbox staging and test services

2021-03-09 Thread JMeybohm
JMeybohm added a comment.


  It was more a matter of a day than month (as we just upgraded the kubernetes 
version in staging). Also we don't enable monitoring for staging in general, 
but of cause errors like that should be catched at deploy time. This can 
currently be done by running `helmfile -e  test --cleanup` which will run 
the test defined in the helm chart. Unfortunately this is not done by default 
and must be triggered manually which we did not after the kubernetes upgrade - 
sorry for that.
  I created T276949 <https://phabricator.wikimedia.org/T276949> to have the 
tests run automatically after the deployment. If this would fit your needs, 
please feel free to close this task.

TASK DETAIL
  https://phabricator.wikimedia.org/T276550

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: JMeybohm, Tarrow, Addshore, Aklapper, Jakob_WMDE, maantietaja, wkandek, 
Akuckartz, darthmon_wmde, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, abian, Wikidata-bugs, 
aude, Lydia_Pintscher, Mbch331, Dzahn
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T264821: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes

2020-10-13 Thread JMeybohm
JMeybohm triaged this task as "Medium" priority.

TASK DETAIL
  https://phabricator.wikimedia.org/T264821

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: Michael, RhinosF1, Joe, LSobanski, Addshore, Ladsgroup, RLazarus, 
Marostegui, Aklapper, CDanis, lmata, wkandek, JMeybohm, Akuckartz, 
darthmon_wmde, Legado_Shulgin, Nandana, jijiki, Davinaclare77, Qtn1293, 
Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, 
LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Wong128hk, Wikidata-bugs, 
aude, faidon, Mbch331, Rxy, Jay8g, fgiunchedi, Dzahn
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-14 Thread JMeybohm
JMeybohm added a comment.


  Looking at the values today it's pretty clear that mw1382 wins and mw1381 
takes the second place.
  The overall memory usage looks like it's safe to leave it this way over the 
weekend. On Monday we should reboot the clusters again, with 
"cgroup.memory=nokmem".

TASK DETAIL
  https://phabricator.wikimedia.org/T260329

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: JMeybohm, Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, 
ArielGlenn, RhinosF1, Joe, lmata, wkandek, Akuckartz, darthmon_wmde, WDoranWMF, 
holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, 
Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, 
Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, 
Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, 
Jay8g, fgiunchedi, Dzahn
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-13 Thread JMeybohm
JMeybohm updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T260329

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, 
RhinosF1, Joe, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, WDoranWMF, 
holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, 
Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, 
Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, 
Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, 
Jay8g, fgiunchedi, Dzahn
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-13 Thread JMeybohm
JMeybohm updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T260329

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, 
RhinosF1, Joe, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, WDoranWMF, 
holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, 
Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, 
Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, 
Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, 
Jay8g, fgiunchedi, Dzahn
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-13 Thread JMeybohm
JMeybohm updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T260329

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, 
RhinosF1, Joe, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, WDoranWMF, 
holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, 
Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, 
Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, 
Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, 
Jay8g, fgiunchedi, Dzahn
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-13 Thread JMeybohm
JMeybohm updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T260329

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, 
RhinosF1, Joe, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, WDoranWMF, 
holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, 
Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, 
Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, 
Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, 
Jay8g, fgiunchedi, Dzahn
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-13 Thread JMeybohm
JMeybohm updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T260329

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, 
RhinosF1, Joe, lmata, wkandek, JMeybohm, Akuckartz, darthmon_wmde, WDoranWMF, 
holger.knust, EvanProdromou, Legado_Shulgin, Nandana, Klaas_Z4us_V, 
Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, 
Hfbn0, QZanden, LawExplorer, Zppix, elukey, _jensen, rosalieper, Agabi10, 
Scott_WUaS, Pchelolo, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Rxy, 
Jay8g, fgiunchedi, Dzahn
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T255410: Termbox SSR connection terminated very often

2020-06-16 Thread JMeybohm
JMeybohm added a comment.


  @Michael thanks for writing this up!
  
  So, if it is safe to assume the MW -> termbox timeout is 3s I would suggest 
we configure the envoys accordingly by setting `tls.upstream_timeout: "3s"` in 
termbox values.yaml as well as `timeout: "3s"` in appservers envoy config (also 
lowering `keepalive` there).

TASK DETAIL
  https://phabricator.wikimedia.org/T255410

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JMeybohm
Cc: akosiaris, JMeybohm, WMDE-leszek, Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, 
Aklapper, Michael, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Lydia_Pintscher, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs