Re: [tor-bugs] #32802 [Internal Services/Tor Sysadmin Team]: decomission kvm4

2020-03-09 Thread Tor Bug Tracker & Wiki
#32802: decomission kvm4
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:  tpa-roadmap-april|  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-
Changes (by anarcat):

 * keywords:   => tpa-roadmap-april


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32802 [Internal Services/Tor Sysadmin Team]: decomission kvm4

2020-01-17 Thread Tor Bug Tracker & Wiki
#32802: decomission kvm4
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 we don't have docs on how to move instances just yet, but i added a
 section in our ganeti manual that should be filled in when we do. for now
 it has references to external manuals that could be used:

 https://help.torproject.org/tsa/howto/ganeti/#index14h2

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32802 [Internal Services/Tor Sysadmin Team]: decomission kvm4

2019-12-19 Thread Tor Bug Tracker & Wiki
#32802: decomission kvm4
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 i will also note that meronense has been seeing disk errors for a while
 now, in #32692. might be another good indication something is wrong with
 this box (although mdadm thinks everything is fine).

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32802 [Internal Services/Tor Sysadmin Team]: decomission kvm4

2019-12-18 Thread Tor Bug Tracker & Wiki
#32802: decomission kvm4
-+-
 Reporter:  anarcat  |  Owner:  tpa
 Type:  project  | Status:  new
 Priority:  High |  Milestone:
Component:  Internal Services/Tor Sysadmin Team  |Version:
 Severity:  Major| Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+-

Comment (by anarcat):

 here's the disaster recovery plan i made up on the fly in #32801, which is
 relevant to the discussion here:

 > According to the Nextcloud spreadsheet (since LDAP is down), [machines
 running on kvm4] includes:
 >
 > || host   || service || impact || mitigation ||
 > || alberti|| LDAP, db.tpo|| critical, no passwd
 change || read-only copies everywhere ||
 > || build-x86-09   || buildbox|| redundant || N/A ||
 > || eugeni || incoming mail, lists|| critical, total outage
 || peek at `tor-puppet/modules/postfix/files/virtual` and email people
 directly ||
 > || meronense  || metrics.tpo || critical, total outage
 || ? ||
 > || neriniflorum   || DNS || redundant, higher TTFB?
 || possible to remove from rotation ||
 > || oo-hetzner-03  || onionoo || redundant || ? ||
 > || pauli  || puppet  || major, no config
 management || use `cumin`, local git copies ||
 > || rouyi  || jenkins || critical, total outage
 || ? ||
 > || web-hetzner-01 || web mirror  || redundant, no effect? ||
 removed from rotation automatically ||
 > || weissi || build box   || no windows builds || N/A
 ||
 > || woronowii  || build box   || no windows builds || N/A
 ||
 >
 > I'll note that it seems both windows build boxes are on the same machine
 so even if jenkins *would* be able to dispatch builds, we wouldn't be able
 to do those...
 >
 > Our disaster recover plan so far is to wait for that rescue to succeed,
 which might take up to 24h but hopefully less.
 >
 > If that fails, I would suggest the following plan:
 >
 >  1. recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere
 (we need those three to build new machines)
 >  2. build a new ganeti cluster (because we can't recover all of this on
 gnt-fsn)
 >  3. restore remaining machines on the new cluster
 >  4. decommission kvm4 officially
 >
 > This could take a few days of work. :(

 Out of that, I would outline the following plan:

  1. in the short term: migrate eugeni, pauli and alberti to a HA cluster,
 probably gnt-fsn (yes, that means it will be over-allocated even more)
  2. in parallel or after (january): add a node or two to the ganeti
 cluster
  3. migrate meronense, neriniflorum, oo-hetzner-03, and rouyi to the new
 cluster

 This would leave the following boxes on kvm4, with the following
 rationale:

  * build-x86-09 - highly redundant, not urgent
  * web-hetzner-01 - one web node already present in the gnt-fsn cluster,
 moving this will not bring us more redundancy
  * weissi - hard to migrate
  * woronowii - hard to migrate

 At that point we'd have the choice to migrate the two windows VM (ugh) and
 the build box to the ganeti cluster, and we'd probably decom web-
 hetzner-01 or move it to kvm5 or some other host, then decom kvm4.

 How does that sound for a plan?

 Tickets would need to be created for each one of those tasks.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs