jcrespo edited projects, added MediaWiki-Database; removed DBA.
jcrespo added a comment.

That is the max lag, and it is normal on the slaves that are not waited by mediawiki. This issue has nothing to do with databases, mediawiki does what it is programmed to do: if it detects lag even if a few seconds, it disables the API (by design)- if you do not like that functionality, a specific change should be directed to the mediawiki persistance functionality maintainers, if you do not like you bot erroring out or crashing, complain to the API client application (as the response clearly tells you to retry in X seconds, but databases themselves have no issues- individual servers lag at a time and that is the way it is programmed (which to me makes sense, but I have no say on that).

Note I am not saying this bug is invalid, I am saying there is nothing broken on the databases or its hardware. For example, there is a vslow slave on each shard that lags all the time, and that is ok because it is not used for main queries (has weight 0). As you can see here, only 1 server is slower - and normally by 1 second ( https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=5&fullscreen&orgId=1&from=1492933452444&to=1493019852444&var-dc=codfw%20prometheus%2Fops ), how the application responds to that is not a database issue as that is a normal state.

Select queries do not cause lag, unless there is something so impacting that blocks all available resources (as Innodb reads never block writes)- only writes do (or blocking queries), so while the queries mentioned may be bad- lag is caused by writes- either bots writing too many rows at a time, or large write transactions. Current response of mediawiki is to put thing in read only when lots of load is detected- if that is undesired (or maybe can be tuned better)- direct your complains to #mediawiki-database (not sure if that is handled by performance, platform or who?), not #DBA (I think queries should only blocked if the majority of slaves are lagged, and not only when one is lagged). Replication is by design asyncronous, and lag will be existing always, unless you want us to slow down writes and write syncronously, reducing the throughput to 1/1000 to the current one (and creating even more errors).

I can tell you what I think it happens- wikidata is highly populated by bots and they create lots of traffic, and I think that causes many times stress on s5, higher than normal. A better approach rather than stopping serving all requests would be to rate-limit the bad users- but I think there is not such a technology yet (and not easy to implement). I have also seen that some job related writes are sometimes too intensive (see T163544) What I can tell you is what we as DBAs will be doing to minimize wikidata impact- we are going to give it dedicated hardware on it own separate shard and call it s8. If that will solve this issues the answer probably is no- I think this is not a bug, and I would support making it invalid- if your client detects an error, it should retry after 1 minute (or the time you are told to wait) once. The only thing I would support is to make the detection (for example, I think the lag detection code is too strict -complaining when the error is >1 second and the detection error is 1+ seconds, and I already transmitted my opinion on that) and its effects lower (like allowing more than X seconds of lag or not blocking in read only mode certain actions), but the general logic I think works as intended.


TASK DETAIL
https://phabricator.wikimedia.org/T123867

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: jcrespo
Cc: Marostegui, Ladsgroup, Aklapper, StudiesWorld, aaron, daniel, aude, Lydia_Pintscher, Multichill, hoo, QZanden, Izno, Wikidata-bugs, Mbch331, Jay8g
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to