[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread daniel
daniel added a comment. The above patch is an alternative attempt to fix the lock retention issue. It's the best I can think of. Not sure this will fix the problem, though.TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread gerritbot
gerritbot added a comment. Change 442883 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler): [mediawiki/core@wmf/1.32.0-wmf.10] Minimize the work done within atomic section in insertRevisionon(). https://gerrit.wikimedia.org/r/442883TASK

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread gerritbot
gerritbot added a comment. Change 442882 abandoned by Daniel Kinzler: Minimize the work done within atomic section in insertRevisionon(). Reason: should not be on master https://gerrit.wikimedia.org/r/442882TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread gerritbot
gerritbot added a comment. Change 442882 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler): [mediawiki/core@master] Minimize the work done within atomic section in insertRevisionon(). https://gerrit.wikimedia.org/r/442882TASK

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread daniel
daniel added a comment. Here are a few things I poked at, without finding anything relevant: @Tgr suspects that the something is grabbing a FOR UPDATE lock on revision_comment_temp. But the only code that seems to do that seems to be in WikiPage::doDeleteArticleReal(), which shouldn't be called

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread daniel
daniel added a comment. After staring at the code a bit, my best guess is: The MCR refactoring introduced doAtomicSection() to RevisionStore::insertRevisionOn(), to preserve consistency between the revision, slots, and content tables. The atomic section also includes the code for writing the

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread Tgr
Tgr added a comment. There are about 1000 errors (in the deploy window, on Commons/Wikidata) where the failing query is on the revision_comment_temp table, and only about 50 where it is not. So I think it is fair to assume that is the primary cause.TASK

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread daniel
daniel added a comment. The DBPerformance log shows a spike during the time wmf-10 was deployed on group1: https://logstash.wikimedia.org/goto/7c86a7d63a305c220a37a3a49844ef2c. The vast majority of entries are for commonswiki. Here are a few examples: Sub-optimal transaction on DB(s) [10.64.48.23

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread Tgr
Tgr added a comment. In T198350#4321620, @matmarex wrote: The errors also did not appear (or at least not in notable numbers) when the changes were deployed to the first set of production wikis on Tuesday (mediawiki.org and test.wp, test2.wp). If this is really MCR-related, that was on the

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread daniel
daniel added a comment. @Aklapper As far as I can see, recent instance of the first two issues T179884 and T197464#4321254, were probably caused by this. The "overwriting image" one (T198177) seems off. It may still be a consequence somehow, but that issue is about updates to the image table,

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread daniel
daniel added a comment. In T198350#4321620, @matmarex wrote: In T198350#4320899, @AlexisJazz wrote: Assuming this was not a case of "It compiles, ship it!" I am curious as to why this wasn't noticed when testing. It appears to require multiple users making actions on a wiki simultaneously, and

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread daniel
daniel added a comment. I created a patch that reverts the MCR patches related to RevisionStore, but keeps the change that introduces PageUpdater. We could deploy the branch with the RevisionStore stuff reverted, and see if it still blows up. Whether or not it does, we'll know more about the

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread gerritbot
gerritbot added a comment. Change 442834 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler): [mediawiki/core@wmf/1.32.0-wmf.10] Revert MCR RevisionStore changes https://gerrit.wikimedia.org/r/442834TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread matmarex
matmarex added a comment. In T198350#4320899, @AlexisJazz wrote: Assuming this was not a case of "It compiles, ship it!" I am curious as to why this wasn't noticed when testing. It appears to require multiple users making actions on a wiki simultaneously, and usually when you test a change

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread jcrespo
jcrespo added a comment. {icon heart} Tgr analysisTASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: dduvall, jcrespoCc: Danmichaelo, jcrespo, Ankry, Nikerabbit, Marostegui, Anomie, cscott, daniel, Tgr,

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread DerHexer
DerHexer added a comment. I got the first error when I used Special:Nuke on Wikimedia Commons. Is it possible that something has not been updated with this old tool?TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread jcrespo
jcrespo added a comment. @Addshore Most of those you point happen all the time, unlike the ones @Marostegui pointed, which are new.TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: dduvall, jcrespoCc: jcrespo,

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread Addshore
Addshore added a comment. In T198350#4320941, @Marostegui wrote: From what I can see it was not only related to that table and to that write, there are lots of others, but the INSERT INTO revision_comment_temp (revcomment_rev,revcomment_comment_id) VALUES ('xx','x') appears quite a lot. There

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-28 Thread Tgr
Tgr added a comment. There has always been a slow but steady stream of lock timeouts: https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=h@e2dcd68&_a=h@5f113f2 F23048550: logstash.wikimedia.org_app_kibana.png revision_comment_temp is the only one that spiked after the deploy(*), so

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread AlexisJazz
AlexisJazz added a comment. Assuming this was not a case of "It compiles, ship it!" I am curious as to why this wasn't noticed when testing.TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: dduvall,

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread Yann
Yann added a comment. There are at least 150 files without a page (and probably more). Starting from https://commons.wikimedia.org/wiki/File:NIG-ARG_(1).jpg at 21:40 (at least) until

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread Tgr
Tgr added a comment. 1770 errors in that one hour, only 240 of them have Wikibase in the stack trace, so not really Wikidata related. They do all target revision_comment_temp. Would be nice to know where the other leg of that lock is. In theory there should only be a conflict when two processes

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread daniel
daniel added a comment. Is that stack trace representative? is it always INSERT INTO revision_comment_temp? It's quite possible that this was caused by the MCR patches, but so far, I don't have any clue as to how or why.TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread Yann
Yann added a comment. Yes, there are a number of files without pages around 22:00 UTC https://commons.wikimedia.org/w/index.php?title=Special:NewFiles=20180627200112=50TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread Raymond
Raymond added a comment. I guess a cleanup routine/check is necessary now. Half-done image uploads like https://commons.wikimedia.org/wiki/File:Typhoon_MyGuide_3500_mobile_-_controller_-_Intel_PXA255A0C300-1180.jpg (missing file page). For this one I will try to create the file page manually.TASK

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread Addshore
Addshore added a comment. Looking at logstash this should now have recoveredTASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: dduvall, AddshoreCc: Addshore, Herzi.Pinki, DC, Yann, Raymond, DerHexer, matmarex,

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread Herzi.Pinki
Herzi.Pinki added a comment. for me it's 100% failuresTASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: dduvall, Herzi.PinkiCc: Herzi.Pinki, DC, Yann, Raymond, DerHexer, matmarex, AlexisJazz, Aklapper, greg,

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread Yann
Yann added a comment. Again [WzPsbApAAD0AAFX1EFwAAABX] 2018-06-27 19:58:52: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: dduvall, YannCc: DC, Yann,

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread Yann
Yann added a comment. Again with the same file: [WzPrpApAIEIAAJ1@bdMAAACA] 2018-06-27 19:55:32: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: dduvall,

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread Yann
Yann added a comment. [WzPq4gpAIDAAAH@yqFsW] 2018-06-27 19:52:19: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" while trying to edit https://commons.wikimedia.org/wiki/File:SHKF-logo.pngTASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread Stashbot
Stashbot added a comment. Mentioned in SAL (#wikimedia-operations) [2018-06-27T19:52:12Z] Rolling back group1 due to rise in error rate (T198350)TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: dduvall,

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread dduvall
dduvall added a comment. A number of SlowTimer errors have shown up in fatalmonitor as well. I'm going to roll back the train for now.TASK DETAILhttps://phabricator.wikimedia.org/T198350EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: dduvallCc: Raymond,

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread DerHexer
DerHexer added a comment. Throws API errors: API request failed (internal_api_error_DBQueryError): [WzPl3wpAIDYAAF1ZcmYC] Database query error. at Wed, 27 Jun 2018 19:30:57 GMT served by mw1342 And UI errors: [WzPmNgpAMFQAAK-3yloW] 2018-06-27 19:32:23: Fataler Ausnahmefehler des Typs

[Wikidata-bugs] [Maniphest] [Commented On] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment

2018-06-27 Thread AlexisJazz
AlexisJazz added a comment. Just uploaded a screenshot, cropped it a bit, tried to upload a new version over it, guess what? Database error A database query error has occurred. This may indicate a bug in the software. [WzPlqwpAIDAAAENRhT8AAADP] 2018-06-27 19:30:05: Fatal exception of type