dcausse added a comment.

  The ElasticaWrite job seems to be receiving roughly the same amount of 
messages (150/s per partition on average) before and after the switch.
  Looking at the partitioned topic ElasticWrite it's heavily backlogged since 
the switch:
  
  F34569267: Capture d’écran du 2021-07-29 10-57-33.png 
<https://phabricator.wikimedia.org/F34569267>
  
  A huge backlog of 5 million messages have been accumulated from jun 28 and 
starts to be absorbed since jul 2. It matches a big bump in processing rates of 
changeprop consumers which is explained by a restart of some of the cpjobqueue 
pods that were behaving poorly :
  
  F34569457: Capture d’écran du 2021-07-29 15-15-03.png 
<https://phabricator.wikimedia.org/F34569457>
  
  Job timings as reported by changeprop do not suggest an increase (quite the 
opposite) but looking at cirrus backend logs for `send_data_write` there is 
clearly a bump in request_time (see P16924 
<https://phabricator.wikimedia.org/P16924>, roughly + 15ms) but could perhaps 
be explained because jobrunners running in codfw now have to write 2 distant 
elasticsearch clusters (prod-eqiad and cloudelastic) as opposed to one. Also 
the jobqueue reports a processing time of 250 to 300ms both before and after 
the switch so I'm not sure that +15ms between mw and elastic could cause such a 
change.
  
  The consumer group lag suggests that we lack processing power but we don't 
seem to produce more ElasticaWrite jobs or I can't find any evidence...
  The increase in processing rate could perhaps be explained by the backlog 
causing the jobqueue to consume messages in bursts?
  
  So possible cause of the backlog:
  
  - more messages produced to cirrusElasticWrite after the switch?
    - the week before the switch this topic ingested 128,916,000 messages the 
week after 147,295,000 (+14%).
  - processing time of the job increased?
    - can't find any evidence of this except a small increase in mw <-> elastic 
timings but not visible in the job processing time
  - changeprop not giving enough room to this job?
    - seemed to have been the case just after switch but then the processing 
rates went higher then usual

TASK DETAIL
  https://phabricator.wikimedia.org/T287563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: EBernhardson, dcausse, Nikki, Aklapper, Lydia_Pintscher, Invadibot, 
MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Gryllida, Addshore, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to