Zabe added a comment.

  Ok, lemme try to quickly summarize what happened and what was done.
  
  Some cloudvirts hosts got accidentally rebooted which caused deployment-prep 
to go offline and it did not came back up by itself.
  
    Request from - via deployment-cache-text06.deployment-prep.eqiad.wmflabs, 
ATS/8.0.8
    Error: 502, Next Hop Connection Failed at 2022-08-16 17:18:02 GMT
  
  Simply restarting apache on the app servers and traffic server on the cache 
servers did not seem to fix the problem.
  
  TNT noticed errors showing up when running logspam-watch on 
deployment-mwlog01, see T315379 <https://phabricator.wikimedia.org/T315379>. 
These fix for those was 
https://gerrit.wikimedia.org/r/c/operations/puppet/+/822453, which was missing 
on that host, so apparently puppet was no longer running.
  
    samtar@deployment-puppetmaster04:~$ sudo run-puppet-agent
    Warning: Unable to fetch my node definition, but the agent run will 
continue:
    Warning: SSL_connect returned=1 errno=0 state=error: certificate verify 
failed (certificate revoked): [certificate revoked for 
/CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
    Info: Retrieving pluginfacts
    Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional 
resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: 
certificate verify failed (certificate revoked): [certificate revoked for 
/CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
    Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not 
retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 
errno=0 state=error: certificate verify failed (certificate revoked): 
[certificate revoked for 
/CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
    Info: Retrieving plugin
    Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources 
using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate 
verify failed (certificate revoked): [certificate revoked for 
/CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
    Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve 
file metadata for puppet:///plugins: SSL_connect returned=1 errno=0 
state=error: certificate verify failed (certificate revoked): [certificate 
revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
    Info: Loading facts
    Error: Could not retrieve catalog from remote server: SSL_connect 
returned=1 errno=0 state=error: certificate verify failed (certificate 
revoked): [certificate revoked for 
/CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
    Warning: Not using cache on failed catalog
    Error: Could not retrieve catalog; skipping run
    Error: Could not send report: SSL_connect returned=1 errno=0 state=error: 
certificate verify failed (certificate revoked): [certificate revoked for 
/CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]
  
  I fixed this failure by regenerating the certificates. I followed 
https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster for that.
  
  There was another reason for puppet not updating and that was that there was 
a merge conflict from a patch with a local cherry-pick, see P32410 
<https://phabricator.wikimedia.org/P32410>. That merge conflict was fixed by 
ori, see T315395 <https://phabricator.wikimedia.org/T315395> for follow-up.
  
  At this point puppet was still not running with the following failure.
  
    zabe@deployment-puppetmaster04:~$ sudo run-puppet-agent
    Warning: Unable to fetch my node definition, but the agent run will 
continue:
    Warning: Failed to open TCP connection to 
deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection 
refused - connect(2) for 
"deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
    Info: Retrieving pluginfacts
    Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional 
resources using 'eval_generate': Failed to open TCP connection to 
deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection 
refused - connect(2) for 
"deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
    Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not 
retrieve file metadata for puppet:///pluginfacts: Failed to open TCP connection 
to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection 
refused - connect(2) for 
"deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
    Info: Retrieving plugin
    Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources 
using 'eval_generate': Failed to open TCP connection to 
deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection 
refused - connect(2) for 
"deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
    Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve 
file metadata for puppet:///plugins: Failed to open TCP connection to 
deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection 
refused - connect(2) for 
"deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
    Info: Loading facts
    Error: Could not retrieve catalog from remote server: Failed to open TCP 
connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 
(Connection refused - connect(2) for 
"deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
    Warning: Not using cache on failed catalog
    Error: Could not retrieve catalog; skipping run
    Error: Could not send report: Failed to open TCP connection to 
deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 (Connection 
refused - connect(2) for 
"deployment-puppetmaster04.deployment-prep.eqiad.wmflabs" port 8140)
  
  After some digging, it turned out that apache was not running and it refused 
to start.
  
    <zabe> AH00526: Syntax error on line 1 of 
/etc/apache2/conf-enabled/50-configmaster-port.conf
    <zabe> Cannot define multiple Listeners on the same IP:port
  
  https://gerrit.wikimedia.org/r/c/operations/puppet/+/823762/ fixed this, 
seems to have been a problem that come up somewhere in 
https://gerrit.wikimedia.org/r/c/operations/puppet/+/797222, 
https://gerrit.wikimedia.org/r/c/operations/puppet/+/798615 and 
https://gerrit.wikimedia.org/r/c/operations/puppet/+/798631.
  
  At this stage puppet was finally running on puppetmaster, but it wasn't on 
`deployment-cache-text06`.
  
    samtar@deployment-cache-text06:~$ sudo run-puppet-agent
    Info: Using configured environment 'production'
    Info: Retrieving pluginfacts
    Info: Retrieving plugin
    Info: Retrieving locales
    Info: Loading facts
    Error: Could not retrieve catalog from remote server: Error 500 on SERVER: 
Server Error: Evaluation Error: Error while evaluating a Resource Statement, 
Evaluation Error: Error while evaluating a Resource Statement, Duplicate 
declaration: Class[Trafficserver] is already declared at (file: 
/etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251); cannot 
redeclare (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 
251) (file: /etc/puppet/modules/trafficserver/manifests/instance.pp, line: 251, 
column: 5) (file: /etc/puppet/modules/profile/manifests/trafficserver/tls.pp, 
line: 168) on node deployment-cache-text06.deployment-prep.eqiad.wmflabs
    Warning: Not using cache on failed catalog
    Error: Could not retrieve catalog; skipping run
  
  Ori fixed this by reverting some puppet patches which seem to be incompatible 
with beta at this stage, see 
https://gerrit.wikimedia.org/r/c/operations/puppet/+/823638 and 
https://gerrit.wikimedia.org/r/c/operations/puppet/+/823639. The follow-up for 
this is T315394 <https://phabricator.wikimedia.org/T315394>.
  
  At this point puppet was finally running everywhere, but beta cluster was 
still unreachable. A simple restart of trafficserver on 
`deployment-cache-text06` fixed that.

TASK DETAIL
  https://phabricator.wikimedia.org/T315350

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Zabe
Cc: jnuche, jbond, thcipriani, ori, Zabe, ArielGlenn, bking, RhinosF1, 
Ryasmeen, matmarex, ppelberg, Daimona, TheresNoTime, Aklapper, Hellket777, 
LisafBia6531, Astuthiodit_1, AWesterinen, 786, TheReadOnly, Biggs657, 
karapayneWMDE, Invadibot, MPhamWMF, maantietaja, Juan90264, Alter-paule, 
Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, CptViraj, Kent7301, 
joker88john, DannyS712, CucyNoiD, Nandana, NebulousIris, Namenlos314, Gaboe420, 
Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, 
_jensen, rosalieper, Neuronton, Liudvikas, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, Jay8g, Krenair
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to