Ladsgroup added a comment.
In T252091#6150993 <https://phabricator.wikimedia.org/T252091#6150993>, @Joe wrote: >> - "The above suggests that the current rate limit is too high," this is not correct, the problem is that there is no rate limit for bots at all. The group explicitly doesn't have a rate limit. Adding such ratelimit was tried and caused lots of issues (even with a pretty high number). > > What kind of issues, specifically? > > I find the idea that we can't impose an upper limit to edits per minute bizarre, in abstract, but there might be good reasons for that. It broke MassMessage T192690: Mass message broken on Wikidata after ratelimit workaround <https://phabricator.wikimedia.org/T192690> and see the discussions in T184948: limit page creation and edit rate on Wikidata <https://phabricator.wikimedia.org/T184948> In T252091#6151004 <https://phabricator.wikimedia.org/T252091#6151004>, @Joe wrote: > So, while I find the idea of using poolcounter to limit the editing **concurrency** (it's not rate-limiting, which is different) a good proposal, and in general something desirable to have (including the possibility we tune it down to zero if we're in a crisis for instance), I think the fundamental problem reported here is that WDQS can't ingest the updates fast enough. My opinion is that there always be a bottleneck in rate of digesting edits in some parts of infra, if we fix WDQS in the next couple of months, edits also scale up and we might hit similar issue in, for example, search index update. See T243701#6152282 <https://phabricator.wikimedia.org/T243701#6152282> > So the solution should be searched there; either we improve performance of WDQS in ingesting updates (and I see there are future plans for that) or we stop considering it when calculating maxLag. We should not limit the edits happening to wikidata just because a dependent system can't keep up the pace. In paper they are dependent but in reality they are not. When we didn't count WDQS lag into maxlag, sometimes the lag was as high as half a day (and growing). This actually caused issues because lots of tools and systems that edit wikidata use WDQS and they started doing basic GIGO because they were getting outdated data, they used that to add wrong data to wikidata and this feedback loop caused issues. Also, it's safe to assume WDQS is lagged maybe even half an hour but when it's lagged for half a day, it breaks lots of implicit assumptions in tool builders, similar if search index in Wikipedia starts to lag behind for a day. In T252091#6154167 <https://phabricator.wikimedia.org/T252091#6154167>, @tstarling wrote: > This proposal is effectively a dynamic rate limit except that instead of delivering an error message when it is exceeded, we will just hold the connection open, forcing the bot to wait. That's expensive in terms of server resources -- we'd rather have the client wait using only its own resources. A rate limit has a tunable parameter (the rate) which is not really knowable. Similarly, this proposal has a tunable parameter (the pool size) which is not really knowable. You have to tune the pool size down until the replag stops increasing, but then if the nature of the edits changes, or if the hardware changes, the optimal pool size will change. > > I suggested at T202107 <https://phabricator.wikimedia.org/T202107> that the best method for globally controlling replication lag would be with a PID controller <https://en.wikipedia.org/wiki/PID_controller>. A PID controller suppresses oscillation by having a memory of recent changes in the metric. The P (proportional) term is essentially as proposed at T240442 <https://phabricator.wikimedia.org/T240442> -- just back off proportionally as the lag increases. The problem with this is that it will settle into an equilibrium lag somewhere in the middle of the range. The I (integral) term addresses this by maintaining a rolling average and adjusting the control value until the average meets the desired value. This allows it to maintain approximately the same edit rate but with a lower average replication lag. The D (derivative) term causes the control value to be reduced more aggressively if the metric is rising quickly. > > My proposal is to use a PID controller to set the Retry-After header. Clients would be strongly encouraged to respect that header. We could have say maxlag=auto to opt in to this system. That sounds like a good alternative that needs exploring, I haven't thought about it in depth but I promise to do and come back to you. TASK DETAIL https://phabricator.wikimedia.org/T252091 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup Cc: tstarling, Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs