Executive summary:
I messed with ldap today. Gerrit handles ldap differently from all
other services, so it broke and it took several ops several hours to
sort out what was happening. Everything is working again now.
Details:
As part of an elaborate post-tampa ballet[1] I moved the ldap
servers from virt1000 and virt0 to ldap-eqiad (aka neptunium for the
time being) and ldap-codfw (aka labcontrol2001). This change was made
this morning via puppet[2].
Much to my delight, labs handled the change gracefully and without
any service interruptions.
Wikitech suffered a brief outage because I neglected to note that
it depends on an ldap server name in the mediawiki config. I hotfixed
that on virt1000 and also submitted a proper patch[3] for review. With
that change wikitech returned to normal, although (as usual) caches are
broken and many users will have to log out and in again to get all the
labs features they're used to.
With the change in ldap server, Gerrit logins went down and stayed
down. At various times Marc, Rob, Brandon and I were all involved in
troubleshooting. Several changes were made to the ldap setup
cluster-wide[4][5] -- these changes are probably correct, but did Gerrit
no good (and getting them applied w/out gerrit was no walk in the park.)
After a great many more blind alleys, Marc noted that we typically
handle ldap certificate validation by specifying a root cert in
ldap.conf, and that is not the Proper Debian Way. Apparently we've just
been lucky so far that most of our ldap services use ldap.conf rather
than the systemwide ca-certificate system.
The right solution is to drop trusted certs into
/usr/local/share/ca-certificates and then regenerate
/etc/ssl/ca-certificates.crt by running update-ca-certificates. Marc
did this on ytterbium (the Gerrit host) and Gerrit immediately started
working again.
Remaining tasks are:
1) Puppetize Marc's hotfix[6]
2) (Maybe) totally refactor how we use ldap everywhere so that it
conforms to Debian standards.
3) Document all the services that rely on ldap so the next time
someone (me, probably) messes with it, they know what to watch for[7]
Many thanks to Marc, Rob and Brandon for joining in when I called
out for help with this problem.
[1] https://wikitech.wikimedia.org/wiki/Ldap_rename
[2] https://gerrit.wikimedia.org/r/#/c/162689/
[3] https://gerrit.wikimedia.org/r/#/c/163189/
[4] https://gerrit.wikimedia.org/r/#/c/163183/
[5] https://gerrit.wikimedia.org/r/#/c/163194/
[6] https://gerrit.wikimedia.org/r/163222
[7] https://wikitech.wikimedia.org/wiki/LDAP
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l