demon added a comment.
Is there anything left here, now that everything in the summary is done?TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: demonCc: Zoranzoki21, daniel, Peachey88, ema, Gehel, Smalyshev, T
demon added a comment.
In T179156#3782516, @awight wrote:
@BBlack Thanks for the detailed notes! All I was going to add was my understanding of how Ext:ORES has the potential for exacerbating any issues with the API layer, simply by consuming with every new edit.
The extension has potential for
Zoranzoki21 added a comment.
Does it made problem with high sleep times in pywiki?TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Zoranzoki21Cc: Zoranzoki21, daniel, Peachey88, ema, Gehel, Smalyshev, TerraCod
awight added a comment.
@BBlack Thanks for the detailed notes! All I was going to add was my understanding of how Ext:ORES has the potential for exacerbating any issues with the API layer, simply by consuming with every new edit.TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCES
BBlack added a comment.
No, we never made an incident rep on this one, and I don't think it would be fair at this time to implicate ORES as a cause. We can't really say that ORES was directly involved at all (or any of the other services investigated here). Because the cause was so unknown at th
awight added a comment.
@hoo Wondering if you wrote an incident report, that I can add to with an explanation of ORES's involvement?TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: awightCc: daniel, Peachey88,
gerritbot added a comment.
Change 387236 merged by Ema:
[operations/debs/varnish4@debian-wmf] Add local patch for transaction_timeout
https://gerrit.wikimedia.org/r/387236TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpref
gerritbot added a comment.
Change 387228 merged by BBlack:
[operations/puppet@production] cache_text: reduce inter-cache backend timeouts as well
https://gerrit.wikimedia.org/r/387228TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/pan
gerritbot added a comment.
Change 387225 merged by BBlack:
[operations/puppet@production] cache_text: reduce applayer timeouts to reasonable values
https://gerrit.wikimedia.org/r/387225TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/p
BBlack added a comment.
In T179156#3720392, @BBlack wrote:
In T179156#3719995, @BBlack wrote:
We have an obvious case of normal slow chunked uploads of large files to commons to look at for examples to observe, though.
Rewinding a little: this is false, I was just getting confused by terminolog
BBlack added a comment.
In T179156#3719995, @BBlack wrote:
We have an obvious case of normal slow chunked uploads of large files to commons to look at for examples to observe, though.
Rewinding a little: this is false, I was just getting confused by terminology. Commons "chunked" uploads throug
daniel added a comment.
Because they're POST they'd be handled as an immediate pass through the varnish layers, so I don't think this would cause what we're looking at now.
"pass" means stream, right? wouldn't that also grab a backend connection from the pool, and hog it if throughput is slow?
We
BBlack added a comment.
In T179156#3719928, @daniel wrote:
In any case, this would consume front-edge client connections, but wouldn't trigger anything deeper into the stack
That's assuming varnish always caches the entire request, and never "streams" to the backend, even for file uploads. When d
daniel added a comment.
In any case, this would consume front-edge client connections, but wouldn't trigger anything deeper into the stack
That's assuming varnish always caches the entire request, and never "streams" to the backend, even for file uploads. When discussing this with @hoo he told me
BBlack added a comment.
Trickled-in POST on the client side would be something else. Varnish's timeout_idle, which is set to 5s on our frontends, acts as the limit for receiving all client request headers, but I'm not sure that it has such a limitation that applies to client-sent bodies. In any c
daniel added a comment.
@BBlack wrote:
something that's doing a legitimate request->response cycle, but trickling out the bytes of it over a very long period.
That's a well known attack method. Could this be coming from the outside, trickling bits bytes of a post? Are we sure we are safe against
gerritbot added a comment.
Change 387236 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/debs/varnish4@debian-wmf] [WIP] backend transaction_timeout
https://gerrit.wikimedia.org/r/387236TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricato
gerritbot added a comment.
Change 387228 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cache_text: reduce inter-cache backend timeouts as well
https://gerrit.wikimedia.org/r/387228TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShtt
gerritbot added a comment.
Change 387225 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cache_text: reduce applayer timeouts to reasonable values
https://gerrit.wikimedia.org/r/387225TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCESh
Lucas_Werkmeister_WMDE added a comment.
In T179156#3719057, @BBlack wrote:
could other services on text-lb be making these kinds of queries to WDQS on behalf of the client and basically proxying the same behavior through?
WikibaseQualityConstraints runs a limited set of queries, but none that co
gerritbot added a comment.
Change 386824 merged by BBlack:
[operations/puppet@production] Revert "cache_text: raise MW connection limits to 10K"
https://gerrit.wikimedia.org/r/386824TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/pane
BBlack added a comment.
In T179156#3718772, @ema wrote:
There's a timeout limiting the total amount of time varnish is allowed to spend on a single request, send_timeout, defaulting to 10 minutes. Unfortunately there's no counter tracking when the timer kicks in, although a debug line is logged t
Stashbot added a comment.
Mentioned in SAL (#wikimedia-operations) [2017-10-30T11:44:38Z] Synchronized wmf-config/Wikibase.php: Re-add property for RDF mapping of external identifiers for Wikidata (T179156, T178180) (duration: 00m 49s)TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREF
gerritbot added a comment.
Change 387190 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Revert "Add property for RDF mapping of external identifiers for Wikidata""
https://gerrit.wikimedia.org/r/387190TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps:/
Stashbot added a comment.
Mentioned in SAL (#wikimedia-operations) [2017-10-30T11:33:14Z] Synchronized wmf-config/Wikibase-production.php: Re-enable constraints check with SPARQL (T179156) (duration: 00m 50s)TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.w
gerritbot added a comment.
Change 387189 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Disable constraints check with SPARQL for now"
https://gerrit.wikimedia.org/r/387189TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/se
Lucas_Werkmeister_WMDE added a comment.
The only live polling feature I can think of that was recently introduced is for the live updates to Special:RecentChanges.
As far as I know, that feature just reloads the recent changes every few seconds with a new request.
Another thing that might be simi
Stashbot added a comment.
Mentioned in SAL (#wikimedia-operations) [2017-10-30T11:15:55Z] rebuilt wikiversions.php and synchronized wikiversions files: Wikidatawiki back to wmf.5 (T179156)TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/setting
gerritbot added a comment.
Change 387188 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Wikidatawiki to wmf.4"
https://gerrit.wikimedia.org/r/387188TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailprefer
ema added a comment.
In T179156#3717895, @BBlack wrote:
My best hypothesis for the "unreasonable" behavior that would break under do_stream=false is that we have some URI which is abusing HTTP chunked responses to stream an indefinite response. Sort of like websockets, but using the normal HTTP p
gerritbot added a comment.
Change 387189 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Revert "Disable constraints check with SPARQL for now"
https://gerrit.wikimedia.org/r/387189TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERE
gerritbot added a comment.
Change 387190 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Revert "Revert "Add property for RDF mapping of external identifiers for Wikidata""
https://gerrit.wikimedia.org/r/387190TASK DETAILhttps://phabricator.wikim
gerritbot added a comment.
Change 387188 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Revert "Wikidatawiki to wmf.4"
https://gerrit.wikimedia.org/r/387188TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.
Legoktm added a comment.
In T179156#3718221, @BBlack wrote:
Does Echo have any kind of push notification going on, even in light testing yet?
Nothing that's deployed AFAIK. The only live polling feature I can think of that was recently introduced is for the live updates to Special:RecentChanges.
BBlack added a comment.
Does Echo have any kind of push notification going on, even in light testing yet?TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liu
ema added a comment.
In T179156#3717847, @BBlack wrote:
For future reference by another opsen who might be looking at this: one of the key metrics that identifies what we've been calling the "target cache" in eqiad, the one that will (eventually) have issues due to whatever bad traffic is currentl
BBlack added a comment.
A while after the above, @hoo started focusing on a different aspect of this we've been somewhat ignoring as more of a side-symptom: that there tend to be a lot of sockets in a strange state on the "target" varnish, to various MW nodes. They look strange on both sides, in t
BBlack added a comment.
Updates from the Varnish side of things today (since I've been bad about getting commits/logs tagged onto this ticket):
18:15 - I took over looking at today's outburst on the Varnish side
The current target at the time was cp1053 (after elukey's earlier restart of cp1055 v
Stashbot added a comment.
Mentioned in SAL (#wikimedia-operations) [2017-10-28T19:39:06Z] Synchronized wmf-config/CommonSettings.php: Half the Flow -> Parsoid timeout (100s -> 50s) (T179156) (duration: 00m 51s)TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator
Stashbot added a comment.
Mentioned in SAL (#wikimedia-operations) [2017-10-28T16:51:54Z] restart varnish backend on cp1055 - mailbox lag + T179156TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: StashbotCc:
hoo added a comment.
Also on mw1180:
$ sudo -u www-data ss --tcp -r -p > ss
$ cat ss | grep -c FIN-WAIT-2
16
$ cat ss | grep -c cp1055
18
$ cat ss | grep -v cp1055 | grep -c FIN-WAIT-2
0TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/
hoo added a comment.
Happening again, this time on cp1055.
Example from mw1180:
$ ss --tcp -r | grep -oP 'cp\d+' | sort | uniq -c
2 cp1053
20 cp1055
2 cp1066
1 cp1068
Also:
$ cat /tmp/apache_status.mw1180.1509206746.txt | grep 10.64.32.107 | wc -l
31
$ cat /tmp/apache_sta
gerritbot added a comment.
Change 386939 merged by BBlack:
[operations/puppet@production] Varnish: puppetize per-backend between_bytes_timeout
https://gerrit.wikimedia.org/r/386939TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/
gerritbot added a comment.
Change 386939 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Varnish: puppetize per-backend between_bytes_timeout
https://gerrit.wikimedia.org/r/386939TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps:
Stashbot added a comment.
Mentioned in SAL (#wikimedia-operations) [2017-10-27T17:54:14Z] Taking mwdebug1001 to do tests regarding T179156TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: StashbotCc: Stashbot,
Stashbot added a comment.
Mentioned in SAL (#wikimedia-operations) [2017-10-27T15:50:58Z] Synchronized wmf-config/Wikibase-production.php: Disable constraints check with SPARQL for now (T179156) (duration: 00m 50s)TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabric
gerritbot added a comment.
Change 386833 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable constraints check with SPARQL for now
https://gerrit.wikimedia.org/r/386833TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/pa
Lucas_Werkmeister_WMDE added a comment.
(Permalink: https://grafana.wikimedia.org/dashboard/db/wikidata-quality?panelId=10&fullscreen&orgId=1&from=now-2d&to=now)
Slightly more permanent link, I think: https://grafana.wikimedia.org/dashboard/db/wikidata-quality?panelId=10&fullscreen&orgId=1&from=15
gerritbot added a comment.
Change 386833 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Disable constraints check with SPARQL for now
https://gerrit.wikimedia.org/r/386833TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps
hoo added a comment.
In T179156#3715446, @BBlack wrote:
In T179156#3715432, @hoo wrote:
I think I found the root cuase now, seems it's actually related to the WikibaseQualityConstraints extension:
Isn't that the same extension referenced in the suspect commits mentioned above?
18:51 ladsgroup
BBlack added a comment.
In T179156#3715432, @hoo wrote:
I think I found the root cuase now, seems it's actually related to the WikibaseQualityConstraints extension:
Isn't that the same extension referenced in the suspect commits mentioned above?
18:51 ladsgroup@tin: Synchronized php-1.31.0-w
gerritbot added a comment.
Change 386824 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Revert "cache_text: raise MW connection limits to 10K"
https://gerrit.wikimedia.org/r/386824TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttp
BBlack added a comment.
Unless anyone objects, I'd like to start with reverting our emergency varnish max_connections changes from https://gerrit.wikimedia.org/r/#/c/386756 . Since the end of the log above, connection counts have returned to normal, which is ~100, which is 1/10th the normal 1K lim
BBlack added a comment.
My gut instinct remains what it was at the end of the log above. I think something in the revert of wikidatawiki to wmf.4 fixed this. And I think given the timing alignment of the Fix sorting of NullResults changes + the initial ORES->wikidata fatals makes those in particu
BBlack added a comment.
Copying this in from etherpad (this is less awful than 6 hours of raw IRC+SAL logs, but still pretty verbose):
# cache servers work ongoing here, ethtool changes that require short depooled downtimes around short ethernet port outages:
17:49 bblack: ulsfo cp servers: rollin
Marostegui added a comment.
From those two masters's (s4 and s5) graphs, we can see that whatever happened, happened exactly at the same time on both servers, so it is unlikely the databases are the cause, but we are just seeing the consequences.TASK DETAILhttps://phabricator.wikimedia.org/T179156E
hoo added a comment.
This has some potentially interesting patterns:
watchlist, recentchanges, contributions, logpager replicas at that time:
s4: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1053&var-port=9104&from=1509043675617&to=150906167
57 matches
Mail list logo