[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

Ladsgroup Wed, 20 May 2020 23:48:20 -0700

Ladsgroup added a comment.


  In T252091#6150993 <https://phabricator.wikimedia.org/T252091#6150993>, @Joe 
wrote:
  
  >> - "The above suggests that the current rate limit is too high," this is 
not correct, the problem is that there is no rate limit for bots at all. The 
group explicitly doesn't have a rate limit. Adding such ratelimit was tried and 
caused lots of issues (even with a pretty high number).
  >
  > What kind of issues, specifically?
  >
  > I find the idea that we can't impose an upper limit to edits per minute 
bizarre, in abstract, but there might be good reasons for that.
  
  It broke MassMessage T192690: Mass message broken on Wikidata after ratelimit 
workaround <https://phabricator.wikimedia.org/T192690> and see the discussions 
in T184948: limit page creation and edit rate on Wikidata 
<https://phabricator.wikimedia.org/T184948>
  
  In T252091#6151004 <https://phabricator.wikimedia.org/T252091#6151004>, @Joe 
wrote:
  
  > So, while I find the idea of using poolcounter to limit the editing 
**concurrency** (it's not rate-limiting, which is different) a good proposal, 
and in general something desirable to have (including the possibility we tune 
it down to zero if we're in a crisis for instance), I think the fundamental 
problem reported here is that WDQS can't ingest the updates fast enough.
  
  My opinion is that there always be a bottleneck in rate of digesting edits in 
some parts of infra, if we fix WDQS in the next couple of months, edits also 
scale up and we might hit similar issue in, for example, search index update. 
See T243701#6152282 <https://phabricator.wikimedia.org/T243701#6152282>
  
  > So the solution should be searched there; either we improve performance of 
WDQS in ingesting updates (and I see there are future plans for that) or we 
stop considering it when calculating maxLag. We should not limit the edits 
happening to wikidata just because a dependent system can't keep up the pace.
  
  In paper they are dependent but in reality they are not. When we didn't count 
WDQS lag into maxlag, sometimes the lag was as high as half a day (and 
growing). This actually caused issues because lots of tools and systems that 
edit wikidata use WDQS and they started doing basic GIGO because they were 
getting outdated data, they used that to add wrong data to wikidata and this 
feedback loop caused issues. Also, it's safe to assume WDQS is lagged maybe 
even half an hour but when it's lagged for half a day, it breaks lots of 
implicit assumptions in tool builders, similar if search index in Wikipedia 
starts to lag behind for a day.
  
  In T252091#6154167 <https://phabricator.wikimedia.org/T252091#6154167>, 
@tstarling wrote:
  
  > This proposal is effectively a dynamic rate limit except that instead of 
delivering an error message when it is exceeded, we will just hold the 
connection open, forcing the bot to wait. That's expensive in terms of server 
resources -- we'd rather have the client wait using only its own resources. A 
rate limit has a tunable parameter (the rate) which is not really knowable. 
Similarly, this proposal has a tunable parameter (the pool size) which is not 
really knowable. You have to tune the pool size down until the replag stops 
increasing, but then if the nature of the edits changes, or if the hardware 
changes, the optimal pool size will change.
  >
  > I suggested at T202107 <https://phabricator.wikimedia.org/T202107> that the 
best method for globally controlling replication lag would be with a PID 
controller <https://en.wikipedia.org/wiki/PID_controller>. A PID controller 
suppresses oscillation by having a memory of recent changes in the metric. The 
P (proportional) term is essentially as proposed at T240442 
<https://phabricator.wikimedia.org/T240442> -- just back off proportionally as 
the lag increases. The problem with this is that it will settle into an 
equilibrium lag somewhere in the middle of the range. The I (integral) term 
addresses this by maintaining a rolling average and adjusting the control value 
until the average meets the desired value. This allows it to maintain 
approximately the same edit rate but with a lower average replication lag. The 
D (derivative) term causes the control value to be reduced more aggressively if 
the metric is rising quickly.
  >
  > My proposal is to use a PID controller to set the Retry-After header. 
Clients would be strongly encouraged to respect that header. We could have say 
maxlag=auto to opt in to this system.
  
  That sounds like a good alternative that needs exploring, I haven't thought 
about it in depth but I promise to do and come back to you.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: tstarling, Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, 
Lydia_Pintscher, WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, DannyS712, 
Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, 
LawExplorer, elukey, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, 
SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, 
santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

Reply via email to