On Mon, Aug 4, 2014 at 7:21 PM, Joel Rees <joel.r...@gmail.com> wrote: > On Tue, Aug 5, 2014 at 6:20 AM, Tom H <tomh0...@gmail.com> wrote: >> On Mon, Aug 4, 2014 at 4:18 PM, Andrew McGlashan >> <andrew.mcglas...@affinityvision.com.au> wrote: >>> On 5/08/2014 5:44 AM, Erwan David wrote: >>>> Le 04/08/2014 21:34, Tom H a écrit : >>>>> >>>>> Suppose that you have a 16-node cluster, some patches were applied to >>>>> the systems overnight, a mistake was made, and you have to correct >>>>> this mistake on all of the systems during trading hours. Once you get >>>>> all the OKs that are needed for this kind of emergency change, the >>>>> head of the trading desk that uses that cluster calls you and says >>>>> "I'm going to be on the line for as long as you're working on our >>>>> system." So you fix one node, reboot it, make sure that it's back in >>>>> the cluster and doing its job, and fix another, etc. You can be sure >>>>> that everyone's happier that the systems boot quickly and that the >>>>> cluster was running with 15 rather than 16 nodes for as few minutes as >>>>> possible (because you can be sure that the fact that this cluster >>>>> wasn't running at full capacity for X minutes will come up in >>>>> managerial meetings, both in IT ones and in IT-Business ones). >>> >>> The argument here is likely that the upgrade should have been tested on >>> a test cluster FIRST and perhaps extensively -- if you have that many >>> servers in play, you should have a development, test and production >>> environment to work with and very stringent change control methods in place. >> >> Come on! Changes go through dev and uat before being rolled out to >> prod. The night-shift sysadmin who made the changes screwed up. It >> happens... > > When the operating system itself tries to hold the night-shift admin > by the hand, we have serious problems. > > Current trading systems are completely wrong. It's no surprise if they > can't get the failover part right, either.
The init system isn't baby-sitting the sysadmin and it has nothing to do with trading system failover. It's a question of having to correct a configuration error one node at a time while the other nodes keep on doing whatever they're emant to be doing and rebooting these nodes as quickly as possible. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/CAOdo=SzoiPO=vfvp2f4cm8vvzf93vcz1-ym4astzk6+7yje...@mail.gmail.com