Ladsgroup added a comment.
In T252091#6171258 <https://phabricator.wikimedia.org/T252091#6171258>, @tstarling wrote: > I hope you don't mind if I contradict my previous comment a bit, since my thinking is still evolving on this. No worries at all. I'm also changing my mind quickly here. > One problem with using lag as the metric is that it doesn't go negative, so the integral will not be pulled down while the service is idle. We could subtract a target lag, say 1 minute, but that loses some of the supposed benefit of including an integral term. A better metric would be updater load, i.e. demand/capacity. When the load is more than 100%, the lag increases at a rate of 1 second per second, but there's no further information in there as to how heavily overloaded it is. When the load is less than 100%, lag decreases until it reaches zero. While it's decreasing, the slope tells you something about how underloaded it is, but once it hits zero, you lose that information. > > Load is average queue size, if you take the currently running batch as being part of the queue. WDQS currently does not monitor the queue size. I gather (after an hour or so of research, I'm new to all this) that with some effort, KafkaPoller could obtain an estimate of the queue size by subtracting the current partition offsets from KafkaConsumer.endOffsets() <https://kafka.apache.org/25/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#endOffsets-java.util.Collection->. > > Failing that, we can make a rough approximation from available data. We can get the average utilisation of the importer from the rdf-repository-import-time-cnt metric. You can see in Grafana <https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=5&fullscreen&orgId=1&refresh=1m> that the derivative of this metric hovers between 0 and 1 when WDQS is not lagged, and remains near 1 when WDQS is lagged. The metric I would propose is to add replication lag to this utilisation metric, appropriately scaled: //utilisation + K_lag * lag - 1// where K_lag is say 1/60s. This is a metric which is -1 at idle, 0 when busy with no lag, and 1 with 1 minute of lag. The control system would adjust the request rate to keep this metric (and its integral) at zero. > >> With PID, we need to define three constants K_p, K_i and K_d. If we had problem with finding the pool size, this is going to get three times more complicated (I didn't find a standard way to determine these coefficients, maybe I'm missing something obvious) > > One way to simplify it is with K_d=0, i.e. make it a PI controller. Having the derivative in there probably doesn't add much. Then it's only two times more complicated. Although I added K_lag so I suppose we are still at 3. The idea is that it shouldn't matter too much exactly what K_p and K_i are set to -- the system should be stable and have low lag with a wide range of parameter values. So you just pick some values and see if it works. > >> We currently don't have an infrastructure to hold the "maxlag" data over time so we can calculate its derivative and integral. Should we use redis? How it's going to look like? These are questions, I don't have answers for them. Do you have ideas for that? > > WDQS lag is currently obtained by having an ApiMaxLagInfo hook handler which queries Prometheus, caching the result. Prometheus has a query language which can perform derivatives ("rate") and integrals ("sum_over_time") on metrics. So it would be the same system as now, just with a different Prometheus query. I might be a little YAGNI here but I would love to have maxlag numbers be kept over time and we build PI controller using the maxlag value and not the lag of WDQS. Mostly because WDQS hopefully will be fixed and handled later but there will be some sort of edit rate bottleneck all the time (jobqueue, replication, you name it) but if you think we can work on WDQS for now, I'm okay. My thinking was to have a P controller for start based on the maxlag and build the infrastructure to keep the data over time (maybe Prometheus?, query statsd? We already store all maxlag there here <https://grafana.wikimedia.org/d/000000156/wikidata-dispatch?panelId=22&fullscreen&orgId=1&refresh=1m&from=now-6h&to=now> but it seems broken atm) and add it there. I think oscillating over 3s is much better than oscillating around 5s because over 5s, the system doesn't accept the edit and the user have to re-send it. > The wording in RFC 7231 suggests to me that it is acceptable to use Retry-After in a 2xx response. "Servers send the "Retry-After" header field to indicate how long the user agent ought to wait before making a follow-up request." That seems pretty close to what we're doing. ack. I think we should communicate this with the tool developers (and pywikibot folks) so they start taking the header all the time. TASK DETAIL https://phabricator.wikimedia.org/T252091 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup Cc: Zbyszko, dcausse, Nikerabbit, Majavah, tstarling, Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, Demian, DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs