Ladsgroup added a comment.
In T252091#6154167 <https://phabricator.wikimedia.org/T252091#6154167>, @tstarling wrote: > This proposal is effectively a dynamic rate limit except that instead of delivering an error message when it is exceeded, we will just hold the connection open, forcing the bot to wait. That's expensive in terms of server resources -- we'd rather have the client wait using only its own resources. A rate limit has a tunable parameter (the rate) which is not really knowable. Similarly, this proposal has a tunable parameter (the pool size) which is not really knowable. You have to tune the pool size down until the replag stops increasing, but then if the nature of the edits changes, or if the hardware changes, the optimal pool size will change. > > I suggested at T202107 <https://phabricator.wikimedia.org/T202107> that the best method for globally controlling replication lag would be with a PID controller <https://en.wikipedia.org/wiki/PID_controller>. A PID controller suppresses oscillation by having a memory of recent changes in the metric. The P (proportional) term is essentially as proposed at T240442 <https://phabricator.wikimedia.org/T240442> -- just back off proportionally as the lag increases. The problem with this is that it will settle into an equilibrium lag somewhere in the middle of the range. The I (integral) term addresses this by maintaining a rolling average and adjusting the control value until the average meets the desired value. This allows it to maintain approximately the same edit rate but with a lower average replication lag. The D (derivative) term causes the control value to be reduced more aggressively if the metric is rising quickly. > > My proposal is to use a PID controller to set the Retry-After header. Clients would be strongly encouraged to respect that header. We could have say maxlag=auto to opt in to this system. I quite like the idea of using PID but there are three notes I want to mention: - With PID, we need to define three constants K_p, K_i and K_d. If we had problem with finding the pool size, this is going to get three times more complicated (I didn't find a standard way to determine these coefficients, maybe I'm missing something obvious) - We currently don't have an infrastructure to hold the "maxlag" data over time so we can calculate its derivative and integral. Should we use redis? How it's going to look like? These are questions, I don't have answers for them. Do you have ideas for that? - I'm not sure "Retry-After" is a good header for 2xx responses. It's like "We accepted your edit but "retry" it after 2 seconds". I looked at RFC 7231 and it doesn't explicitly say we can't use it in 2xx requests but I haven't seen anywhere use it in 2xx responses. We might be able to find another better header? TASK DETAIL https://phabricator.wikimedia.org/T252091 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup Cc: tstarling, Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, Demian, DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs