Zabe added a comment.
Ok, lemme try to quickly summarize what happened and what was done. Some cloudvirts hosts got accidentally rebooted which caused deployment-prep to go offline and it did not came back up by itself. Request from - via deployment-cache-text06.deployment-prep.eqiad.wmflabs, ATS/8.0.8 Error: 502, Next Hop Connection Failed at 2022-08-16 17:18:02 GMT Simply restarting apache on the app servers and traffic server on the cache servers did not seem to fix the problem. TNT noticed errors showing up when running logspam-watch on deployment-mwlog01, see T315379 <https://phabricator.wikimedia.org/T315379>. These fix for those was https://gerrit.wikimedia.org/r/c/operations/puppet/+/822453, which was missing on that host, so apparently puppet was no longer running. samtar@deployment-puppetmaster04:~$ sudo run-puppet-agent Warning: Unable to fetch my node definition, but the agent run will continue: Warning: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs] Info: Retrieving pluginfacts Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs] Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs] Info: Retrieving plugin Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs] Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve file metadata for puppet:///plugins: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs] Info: Loading facts Error: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs] Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run Error: Could not send report: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs] I fixed this failure by regenerating the certificates. I followed https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster for that. There was another reason for puppet not updating and that was that there was a merge conflict from a patch with a local cherry-pick, see P32410 <https://phabricator.wikimedia.org/P32410>. That merge conflict was fixed by ori, see T315395 <https://phabricator.wikimedia.org/T315395> for follow-up. At this point puppet was still not running with the following failure. zabe@deployment-puppetmaster04:~$ sudo run-puppet-agent Warning: Unable to fetch my node definition, but the agent run will continue: Warning: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140) Info: Retrieving pluginfacts Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140) Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140) Info: Retrieving plugin Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140) Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve file metadata for puppet:///plugins: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140) Info: Loading facts Error: Could not retrieve catalog from remote server: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140) Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run Error: Could not send report: Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection refused - connect(2) for "deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140) After some digging, it turned out that apache was not running and it refused to start. <zabe> AH00526: Syntax error on line 1 of /etc/apache2/conf-enabled/50-configmaster-port.conf <zabe> Cannot define multiple Listeners on the same IP:port https://gerrit.wikimedia.org/r/c/operations/puppet/+/823762/ fixed this, seems to have been a problem that come up somewhere in https://gerrit.wikimedia.org/r/c/operations/puppet/+/797222, https://gerrit.wikimedia.org/r/c/operations/puppet/+/798615 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/798631. At this stage puppet was finally running on puppetmaster, but it wasn't on `deployment-cache-text06`. samtar@deployment-cache-text06:~$ sudo run-puppet-agent Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Trafficserver] is already declared at (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251); cannot redeclare (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251) (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251, column: 5) (file: /etc/puppet/modules/profile/manifests/trafficserver/tls.pp, line: 168) on node deployment-cache-text06.deployment-prep.eqiad.wmflabs Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run Ori fixed this by reverting some puppet patches which seem to be incompatible with beta at this stage, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/823638 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/823639. The follow-up for this is T315394 <https://phabricator.wikimedia.org/T315394>. At this point puppet was finally running everywhere, but beta cluster was still unreachable. A simple restart of trafficserver on `deployment-cache-text06` fixed that. TASK DETAIL https://phabricator.wikimedia.org/T315350 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Zabe Cc: jnuche, jbond, thcipriani, ori, Zabe, ArielGlenn, bking, RhinosF1, Ryasmeen, matmarex, ppelberg, Daimona, TheresNoTime, Aklapper, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, TheReadOnly, Biggs657, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, CptViraj, Kent7301, joker88john, DannyS712, CucyNoiD, Nandana, NebulousIris, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Liudvikas, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, Jay8g, Krenair
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org