dcausse added a comment.
In T252091#6171258 <https://phabricator.wikimedia.org/T252091#6171258>, @tstarling wrote: > Load is average queue size, if you take the currently running batch as being part of the queue. WDQS currently does not monitor the queue size. I gather (after an hour or so of research, I'm new to all this) that with some effort, KafkaPoller could obtain an estimate of the queue size by subtracting the current partition offsets from KafkaConsumer.endOffsets() <https://kafka.apache.org/25/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#endOffsets-java.util.Collection->. This metric is available in graphana through `kafka_burrow_partition_lag`, problem is that for some reasons we stopped polling updates from Kafka and we're now consuming the RC change API. The reasons we disabled it are now fixed so I believe we could enable it again. In the ideal case the updater runs at full speed most of the time as the effect of the maxlag propagates fast enough that the system in place works for what it was designed: make sure users don't query and see too much out of date data and don't starve too much when the threshold is green again. One problem that the current maxlag strategy does not address properly is when a single server is lagged situations like this starts to happen: F31845471: lag_wdqs.png <https://phabricator.wikimedia.org/F31845471> The median across all pooled servers being used the effect of the maxlag no longer propagates fast enough, for high lagged servers they see the effect of the edit rate slowdown that happened 10 mins ago while others sees their queue being emptied while they could have handled more. All this being pretty much random (spikes across servers are at different times) it exacerbates even more the oscillation. Was it evaluated to take the max or the sum instead of the median? As said in a previous comment there always be bottleneck somewhere, I feel that having a single fixed limit makes it a bit difficult to handle the variance in the edit rate and could encourage us to always tune it to a lower value to resolve such lag issues without knowing when your system can handle more. A solution around Retry-After and a PID controller seems a bit more flexible to me, the main drawbacks is that it relies on well behaved clients (which is currently the case). As for addressing the issue with the updater itself, we believe we have room for optimizations by redesigning the way we perform updates. The current situation is clearly not ideal but it can keep-up the update rates when bots are slowed down which gives us I hope enough time to finish the work we started on this rewrite. TASK DETAIL https://phabricator.wikimedia.org/T252091 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: Zbyszko, dcausse, Nikerabbit, Majavah, tstarling, Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, Demian, DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs