#32802: decomission kvm4 -------------------------------------------------+--------------------- Reporter: anarcat | Owner: tpa Type: project | Status: new Priority: High | Milestone: Component: Internal Services/Tor Sysadmin Team | Version: Severity: Major | Resolution: Keywords: | Actual Points: Parent ID: | Points: Reviewer: | Sponsor: -------------------------------------------------+---------------------
Comment (by anarcat): here's the disaster recovery plan i made up on the fly in #32801, which is relevant to the discussion here: > According to the Nextcloud spreadsheet (since LDAP is down), [machines running on kvm4] includes: > > || host || service || impact || mitigation || > || alberti || LDAP, db.tpo || critical, no passwd change || read-only copies everywhere || > || build-x86-09 || buildbox || redundant || N/A || > || eugeni || incoming mail, lists || critical, total outage || peek at `tor-puppet/modules/postfix/files/virtual` and email people directly || > || meronense || metrics.tpo || critical, total outage || ? || > || neriniflorum || DNS || redundant, higher TTFB? || possible to remove from rotation || > || oo-hetzner-03 || onionoo || redundant || ? || > || pauli || puppet || major, no config management || use `cumin`, local git copies || > || rouyi || jenkins || critical, total outage || ? || > || web-hetzner-01 || web mirror || redundant, no effect? || removed from rotation automatically || > || weissi || build box || no windows builds || N/A || > || woronowii || build box || no windows builds || N/A || > > I'll note that it seems both windows build boxes are on the same machine so even if jenkins *would* be able to dispatch builds, we wouldn't be able to do those... > > Our disaster recover plan so far is to wait for that rescue to succeed, which might take up to 24h but hopefully less. > > If that fails, I would suggest the following plan: > > 1. recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere (we need those three to build new machines) > 2. build a new ganeti cluster (because we can't recover all of this on gnt-fsn) > 3. restore remaining machines on the new cluster > 4. decommission kvm4 officially > > This could take a few days of work. :( Out of that, I would outline the following plan: 1. in the short term: migrate eugeni, pauli and alberti to a HA cluster, probably gnt-fsn (yes, that means it will be over-allocated even more) 2. in parallel or after (january): add a node or two to the ganeti cluster 3. migrate meronense, neriniflorum, oo-hetzner-03, and rouyi to the new cluster This would leave the following boxes on kvm4, with the following rationale: * build-x86-09 - highly redundant, not urgent * web-hetzner-01 - one web node already present in the gnt-fsn cluster, moving this will not bring us more redundancy * weissi - hard to migrate * woronowii - hard to migrate At that point we'd have the choice to migrate the two windows VM (ugh) and the build box to the ganeti cluster, and we'd probably decom web- hetzner-01 or move it to kvm5 or some other host, then decom kvm4. How does that sound for a plan? Tickets would need to be created for each one of those tasks. -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32802#comment:1> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online
_______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs