[Wikidata-bugs] [Maniphest] T255410: Termbox SSR connection terminated very often

akosiaris Mon, 19 Oct 2020 07:13:04 -0700

akosiaris added a comment.

  In T255410#6550492 <https://phabricator.wikimedia.org/T255410#6550492>, 
@Michael wrote:

  > 

  > That seems very strange. I would have expected the //error rate// to be 
calculated by `(number of errors / number of total requests)` for the given 
timeframe. How does it actually work? Something like `(number of milliseconds 
with error/number of total milliseconds in timeframe)`?

  You can say that again :-). The main formula is what you described. In 
prometheus terms, it's

sum(increase(service_runner_request_duration_seconds_count{service="termbox", 
prometheus="k8s", uri="termbox", status="500"}[$__range]))

    /

sum(increase(service_runner_request_duration_seconds_count{service="termbox", 
prometheus="k8s", uri="termbox", status=~"200|500"}[$__range]))

  and that's what the left panel in that dashboard has. The issue isn't with 
the division, it's rather with the increase() function (the right hand side 
panel is just the nominator of the above equation), so it's

    sum(
      increase(
         service_runner_request_duration_seconds_count{service="termbox", 
prometheus="k8s", uri="termbox", status="500"}[$__range]
      )
    )

  The `sum()` is to sum across all the instances of termbox in that timeframe, 
the `increase()` is to calculate the change in that quantity from start to end 
of the timeframe. Normally it works, but in this case, it has failed. My guess 
as to what has happened is that due to 2 deployments (you can use the main 
termbox dashboard to spot them) termbox pods were destroyed and new ones 
started. So metrics changed and the internal counter resetting detection of 
rate() could not function. If you target a week without deploys, you aren't 
gonna witness that.

  If you are more interested about prometheus counter, there's more info about 
counters and how they work in prometheus at 
https://www.robustperception.io/how-does-a-prometheus-counter-work

  It also means we 'll have to figure out how to calculate better the SLO 
across large timeframes.

TASK DETAIL
  https://phabricator.wikimedia.org/T255410

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Michael, akosiaris
Cc: toan, Lucas_Werkmeister_WMDE, Sakretsu, akosiaris, JMeybohm, WMDE-leszek, 
Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, wkandek, 
Akuckartz, Iflorez, darthmon_wmde, alaa_wmde, Nandana, jijiki, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] T255410: Termbox SSR connection terminated very often

Reply via email to