gerritbot added a comment.
Change 483943 merged by jenkins-bot:
[mediawiki/core@REL1_30] rdbms: reduce LoadBalancer replication log spam
https://gerrit.wikimedia.org/r/483943TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL
gerritbot added a comment.
Change 483859 merged by jenkins-bot:
[mediawiki/core@REL1_32] rdbms: reduce LoadBalancer replication log spam
https://gerrit.wikimedia.org/r/483859TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL
gerritbot added a comment.
Change 483942 merged by jenkins-bot:
[mediawiki/core@REL1_31] rdbms: reduce LoadBalancer replication log spam
https://gerrit.wikimedia.org/r/483942TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL
gerritbot added a comment.
Change 483943 had a related patch set uploaded (by Reedy; owner: Aaron Schulz):
[mediawiki/core@REL1_30] rdbms: reduce LoadBalancer replication log spam
https://gerrit.wikimedia.org/r/483943TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL
gerritbot added a comment.
Change 483942 had a related patch set uploaded (by Reedy; owner: Aaron Schulz):
[mediawiki/core@REL1_31] rdbms: reduce LoadBalancer replication log spam
https://gerrit.wikimedia.org/r/483942TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL
gerritbot added a comment.
Change 483859 had a related patch set uploaded (by Paladox; owner: Aaron Schulz):
[mediawiki/core@REL1_32] rdbms: reduce LoadBalancer replication log spam
https://gerrit.wikimedia.org/r/483859TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL
hoo added a comment.
In T204531#4813322, @hoo wrote:
I just revisited the original logs… and I found that the way way way more common error back then was from LoadBalancer::getRandomNonLagged:
2018-09-19 07:13:27 [1d74bb1b0e64917ec3cd47e7] snapshot1008 wikidatawiki 1.32.0-wmf.20 DBReplication
gerritbot added a comment.
Change 478773 merged by jenkins-bot:
[mediawiki/core@master] rdbms: reduce LoadBalancer replication log spam
https://gerrit.wikimedia.org/r/478773TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL
hoo added a comment.
I just revisited the original logs… and I found that the way way way more common error back then was from LoadBalancer::getRandomNonLagged:
2018-09-19 07:22:32 [c36272088a424bcc438a31e9] snapshot1008 wikidatawiki 1.32.0-wmf.20 DBReplication ERROR: Server db1109 has
gerritbot added a comment.
Change 478773 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: reduce LoadBalancer replication log spam
https://gerrit.wikimedia.org/r/478773TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL
ArielGlenn added a comment.
Ah, I misread! Before we arbitrarily throttle though, I'd like the dbas to weigh in. It might be that having the flood actually flood may be helpful sometimes.TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL
hoo added a comment.
Anything that depends on a particular db server makes me uneasy
Hm… above I meant that for every DB server it should ever warn ever (like) LAG_WARN_THRESHOLD / 2 seconds… for all DB servers… it makes little sense to let a single LoadMonitor instance create more log entries
ArielGlenn added a comment.
Of the above options, I prefer the idea of explicitly increasing the LAG_WARN_THRESHOLD in scripts where that's desired. Anything that depends on a particular db server makes me uneasy, and I don't like the idea of turning off load monitoring completely either.TASK
hoo added a comment.
Or we could alter LoadMonitor::getServerStates to not log for a certain db-server more than every N seconds (where N could be derived from LAG_WARN_THRESHOLD, which currently is 10s).TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL
hoo added a comment.
I just looked into this some more… We could also create a new LB settings key which would be passed as lagWarnThreshold to LoadMonitor (currently this options is never set, as their is no way to do so, thus this always defaults to LoadMonitor::LAG_WARN_THRESHOLD (10s)).
With
hoo added a comment.
Just for the record, the errors were:
2018-09-19 07:13:15 [580058d7b2babd5f85e725ec] snapshot1008 wikidatawiki 1.32.0-wmf.20 DBReplication ERROR: Server db1109 has 29.26508808136 seconds of lag (>= 10) {"host":"db1109","lag":29.26508808136,"maxlag":10}
Was there anything
ArielGlenn added a comment.
Well, the mystery of 'why eqiad' is solved; the choice of whether to parse db-eqiad.php or db-codfw.php is determined by the global $wmfDatacenter. This is set in multiversion/MWRealm.php from the value of $wmfCluster, which is taken from the contents of
ArielGlenn added a comment.
root@snapshot1008:~# netstat -a -p | grep php | grep eqiad | grep db
tcp0 0 snapshot1008.eqia:41898 db1087.eqiad.wmne:mysql ESTABLISHED 105786/php7.0
tcp0 0 snapshot1008.eqia:42528 db1087.eqiad.wmne:mysql ESTABLISHED 107804/php7.0
ArielGlenn added a comment.
The wikidata weeklies are running now; the 'regular' (xml/sql) dumps completed on Sept 15th for the first run of the month.TASK DETAILhttps://phabricator.wikimedia.org/T204531EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To:
19 matches
Mail list logo