akosiaris added a comment.
In T255410#6543118 <https://phabricator.wikimedia.org/T255410#6543118>, @Michael wrote: > @akosiaris Thank you a lot for your detailed response. I did look into those errors a tiny bit more to properly document them as can be now seen on wikitech <https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Availability_objectives_and_accepted_operational_errors>. > > In the course of that I looked at the last days and noticed some discrepancies to the numbers you provided above. All the following data is for the 7 days between 2020-10-07 00:00:00 and 2020-10-13 23:59:59. I think you just exposed some weird behavior/bug in prometheus's `increase()` function regarding counter resets. I 've added a panel to the graph showcasing it. If you manually substract the peaks from the valleys for the 3 distinct timeframes depicted there you get almost the same errors as logstash. It's `62-0 + 99 - 0 + 484 - 440= 170`. It's probably that last (first timewise) timeframe that throughs prometheus off. Given that per the docs [1] It is syntactic sugar for rate(v) multiplied by the number of seconds under the specified time range window, and should be used primarily for human readability. there is probably something funny going on over the large timeframe. The rate() is also depicted in the panel and itis gradually dropping as well but it's quite higher in the first timeframe. Couple of notes though to clarify a few things. > - the Grafana SLO panel <https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1602028800000&to=1602633599000> shows **277** errors. This is from the PoV of termbox itself. It count the HTTP 500s termbox knows it emitted. > - the kubernetes logstash for Termbox SSR <https://logstash-next.wikimedia.org/goto/dbf13405ed13e217d271c2ce1f694ae7> has **171** errors in that time frame > - 120 timeout errors, 51 envoy 503 errors > - I excluded some 19 errors about "startup finished", that are probably the ones you mentioned with "not worth looking into" Same PoV but on a log level. > I was surprised by that, but noticed that there were also a similar amount of network errors between MediaWiki and the Termbox SSR app in that timeframe: > > - the MediaWiki (PHP) logstash <https://logstash.wikimedia.org/goto/995becc306bb3da55de9e321631c40d0> has **104** errors of Termbox being unreachable That's actually from the PoV of mediawiki. If you put this logstash dashboard and the termbox one side-by-side there's considerable overlap as events are depicted in both. > It would make sense to me if the SLO covered those network problems as well, as they defacto impact the availability of the service to MediaWiki. Also, taking those errors together, we can account for 275 of the 277 errors shown in the Grafana SLO panel. > > Is the understanding layed out above correct? I think it's wrong to sum the 2 logstash dashboards (in fact, it's just coincidence that the numbers added up to something close to 277 as that was a made up number from prometheus). They are of a different nature and thus wrong to sum as you will be double counting events. [1] https://prometheus.io/docs/prometheus/latest/querying/functions/#increase TASK DETAIL https://phabricator.wikimedia.org/T255410 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Michael, akosiaris Cc: toan, Lucas_Werkmeister_WMDE, Sakretsu, akosiaris, JMeybohm, WMDE-leszek, Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, wkandek, Akuckartz, Iflorez, darthmon_wmde, alaa_wmde, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs