dcausse added a comment.

  In T252091#6171258 <https://phabricator.wikimedia.org/T252091#6171258>, 
@tstarling wrote:
  
  > Load is average queue size, if you take the currently running batch as 
being part of the queue. WDQS currently does not monitor the queue size. I 
gather (after an hour or so of research, I'm new to all this) that with some 
effort, KafkaPoller could obtain an estimate of the queue size by subtracting 
the current partition offsets from KafkaConsumer.endOffsets() 
<https://kafka.apache.org/25/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#endOffsets-java.util.Collection->.
  
  This metric is available in graphana through `kafka_burrow_partition_lag`, 
problem is that for some reasons we stopped polling updates from Kafka and 
we're now consuming the RC change API. The reasons we disabled it are now fixed 
so I believe we could enable it again.
  
  In the ideal case the updater runs at full speed most of the time as the 
effect of the maxlag propagates fast enough that the system in place works for 
what it was designed: make sure users don't query and see too much out of date 
data and don't starve too much when the threshold is green again.
  One problem that the current maxlag strategy does not address properly is 
when a single server is lagged situations like this starts to happen:
  F31845471: lag_wdqs.png <https://phabricator.wikimedia.org/F31845471>
  The median across all pooled servers being used the effect of the maxlag no 
longer propagates fast enough, for high lagged servers they see the effect of 
the edit rate slowdown that happened 10 mins ago while others sees their queue 
being emptied while they could have handled more. All this being pretty much 
random (spikes across servers are at different times) it exacerbates even more 
the oscillation. Was it evaluated to take the max or the sum instead of the 
median?
  
  As said in a previous comment there always be bottleneck somewhere, I feel 
that having a single fixed limit makes it a bit difficult to handle the 
variance in the edit rate and could encourage us to always tune it to a lower 
value to resolve such lag issues without knowing when your system can handle 
more.
  A solution around Retry-After and a PID controller seems a bit more flexible 
to me, the main drawbacks is that it relies on well behaved clients (which is 
currently the case).
  
  As for addressing the issue with the updater itself, we believe we have room 
for optimizations by redesigning the way we perform updates. The current 
situation is clearly not ideal but it can keep-up the update rates when bots 
are slowed down which gives us I hope enough time to finish the work we started 
on this rewrite.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Zbyszko, dcausse, Nikerabbit, Majavah, tstarling, Joe, Dvorapa, daniel, 
Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, darthmon_wmde, 
Addshore, Ladsgroup, Demian, DannyS712, Nandana, kostajh, Lahi, Gq86, 
GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, _jensen, 
rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, 
Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to