tstarling added a comment.

  I hope you don't mind if I contradict my previous comment a bit, since my 
thinking is still evolving on this.
  
  One problem with using lag as the metric is that it doesn't go negative, so 
the integral will not be pulled down while the service is idle. We could 
subtract a target lag, say 1 minute, but that loses some of the supposed 
benefit of including an integral term. A better metric would be updater load, 
i.e. demand/capacity. When the load is more than 100%, the lag increases at a 
rate of 1 second per second, but there's no further information in there as to 
how heavily overloaded it is. When the load is less than 100%, lag decreases 
until it reaches zero. While it's decreasing, the slope tells you something 
about how underloaded it is, but once it hits zero, you lose that information.
  
  Load is average queue size, if you take the currently running batch as being 
part of the queue. WDQS currently does not monitor the queue size. I gather 
(after an hour or so of research, I'm new to all this) that with some effort, 
KafkaPoller could obtain an estimate of the queue size by subtracting the 
current partition offsets from KafkaConsumer.endOffsets() 
<https://kafka.apache.org/25/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#endOffsets-java.util.Collection->.
  
  Failing that, we can make a rough approximation from available data. We can 
get the average utilisation of the importer from the 
rdf-repository-import-time-cnt metric. You can see in Grafana 
<https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=5&fullscreen&orgId=1&refresh=1m>
 that the derivative of this metric hovers between 0 and 1 when WDQS is not 
lagged, and remains near 1 when WDQS is lagged. The metric I would propose is 
to add replication lag to this utilisation metric, appropriately scaled: 
//utilisation + K_lag * lag - 1// where K_lag is say 1/60s. This is a metric 
which is -1 at idle, 0 when busy with no lag, and 1 with 1 minute of lag. The 
control system would adjust the request rate to keep this metric (and its 
integral) at zero.
  
  > With PID, we need to define three constants K_p, K_i and K_d. If we had 
problem with finding the pool size, this is going to get three times more 
complicated (I didn't find a standard way to determine these coefficients, 
maybe I'm missing something obvious)
  
  One way to simplify it is with K_d=0, i.e. make it a PI controller. Having 
the derivative in there probably doesn't add much. Then it's only two times 
more complicated. Although I added K_lag so I suppose we are still at 3. The 
idea is that it shouldn't matter too much exactly what K_p and K_i are set to 
-- the system should be stable and have low lag with a wide range of parameter 
values. So you just pick some values and see if it works.
  
  > We currently don't have an infrastructure to hold the "maxlag" data over 
time so we can calculate its derivative and integral. Should we use redis? How 
it's going to look like? These are questions, I don't have answers for them. Do 
you have ideas for that?
  
  WDQS lag is currently obtained by having an ApiMaxLagInfo hook handler which 
queries Prometheus, caching the result. Prometheus has a query language which 
can perform derivatives ("rate") and integrals ("sum_over_time") on metrics. So 
it would be the same system as now, just with a different Prometheus query.
  
  > I'm not sure "Retry-After" is a good header for 2xx responses. It's like 
"We accepted your edit but "retry" it after 2 seconds". I looked at RFC 7231 
and it doesn't explicitly say we can't use it in 2xx requests but I haven't 
seen anywhere use it in 2xx responses. We might be able to find another better 
header?
  
  The wording in RFC 7231 suggests to me that it is acceptable to use 
Retry-After in a 2xx response. "Servers send the "Retry-After" header field to 
indicate how long the user agent ought to wait before making a follow-up 
request." That seems pretty close to what we're doing.
  
  In summary, we query Prometheus for //utilisation + lag / 60 - 1//, both the 
most recent value and the sum over some longer time interval. The sum and the 
value are separately scaled, then they are added together, then the result is 
limited to some reasonable range like 0-600s. If it's >0, then we send it as a 
Retry-After header. Then we badger all bots into respecting the header.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: tstarling
Cc: Majavah, tstarling, Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, 
Lydia_Pintscher, WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, Demian, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, elukey, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, 
Izno, SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, 
fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to