akosiaris added a comment.

  In T255410#6543118 <https://phabricator.wikimedia.org/T255410#6543118>, 
@Michael wrote:
  
  > @akosiaris Thank you a lot for your detailed response. I did look into 
those errors a tiny bit more to properly document them as can be now seen on 
wikitech 
<https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Availability_objectives_and_accepted_operational_errors>.
  >
  > In the course of that I looked at the last days and noticed some 
discrepancies to the numbers you provided above. All the following data is for 
the 7 days between 2020-10-07 00:00:00 and 2020-10-13 23:59:59.
  
  I think you just exposed some weird behavior/bug in prometheus's `increase()` 
function regarding counter resets. I 've added a panel to the graph showcasing 
it. If you manually substract the peaks from the valleys for the 3 distinct 
timeframes depicted there you get almost the same errors  as logstash. It's 
`62-0 + 99 - 0 + 484 - 440= 170`. It's probably that last (first timewise) 
timeframe that throughs prometheus off. Given that per the docs [1]
  
    It is syntactic sugar for rate(v) multiplied by the number of seconds under 
the specified time range window, and should be used primarily for human 
readability.
  
  there is probably something funny going on over the large timeframe. The 
rate() is also depicted in the panel and itis gradually dropping as well but 
it's quite higher in the first timeframe.
  
  Couple of notes though to clarify a few things.
  
  > - the Grafana SLO panel 
<https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1602028800000&to=1602633599000>
 shows **277** errors.
  
  This is from the PoV of termbox itself. It count the HTTP 500s termbox knows 
it emitted.
  
  > - the kubernetes logstash for Termbox SSR 
<https://logstash-next.wikimedia.org/goto/dbf13405ed13e217d271c2ce1f694ae7> has 
**171** errors in that time frame
  >   - 120 timeout errors, 51 envoy 503 errors
  >   - I excluded some 19 errors about "startup finished", that are probably 
the ones you mentioned with "not worth looking into"
  
  Same PoV but on a log level.
  
  > I was surprised by that, but noticed that there were also a similar amount 
of network errors between MediaWiki and the Termbox SSR app in that timeframe:
  >
  > - the MediaWiki (PHP) logstash 
<https://logstash.wikimedia.org/goto/995becc306bb3da55de9e321631c40d0> has 
**104** errors of Termbox being unreachable
  
  That's actually from the PoV of mediawiki. If you put this logstash dashboard 
and the termbox one side-by-side there's considerable overlap as events are 
depicted in both.
  
  > It would make sense to me if the SLO covered those network problems as 
well, as they defacto impact the availability of the service to MediaWiki. 
Also, taking those errors together, we can account for 275 of the 277 errors 
shown in the Grafana SLO panel.
  >
  > Is the understanding layed out above correct?
  
  I think it's wrong to sum the 2 logstash dashboards (in fact, it's just 
coincidence that the numbers added up to something close to 277 as that was a 
made up number from prometheus). They are of a different nature and thus wrong 
to sum as you will be double counting events.
  
  [1] https://prometheus.io/docs/prometheus/latest/querying/functions/#increase

TASK DETAIL
  https://phabricator.wikimedia.org/T255410

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Michael, akosiaris
Cc: toan, Lucas_Werkmeister_WMDE, Sakretsu, akosiaris, JMeybohm, WMDE-leszek, 
Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, wkandek, 
Akuckartz, Iflorez, darthmon_wmde, alaa_wmde, Nandana, jijiki, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to