Wow! I'm looking forward to the September summit talk. On 07/05/2017 01:52 AM, Digimer wrote: > Hi all, > > I suspect by now, many of you here have heard me talk about the Anvil! > intelligent availability platform. Today, I am proud to announce that it > is ready for general use! > > https://github.com/ClusterLabs/striker/releases/tag/v2.0.0 > > I started five years ago with an idea of building an "Availability > Appliance". A single machine where any part could be failed, removed and > replaced without needing a maintenance window. A system with no single > point of failure anywhere wrapped behind a very simple interface. > > The underlying architecture that provides this redundancy was laid > down years ago as an early tutorial and has been field tested all over > North America and around the world in the years since. In that time, the > Anvil! platform has demonstrated over 99.9999% availability! > > Starting back then, the goal was to write the web interface that made > it easy to use the Anvil! platform. Then, about two years ago, I decided > that an Anvil! could be much, much more than just an appliance. > > It could think for itself. > > Today, I would like to announce version 2.0.0. This releases > introduces the ScanCore "decision engine". ScanCore can be thought of as > a sort of "Layer 3" availability platform. Where Corosync provides > membership and communications, with Pacemaker (and rgmanager) sitting on > top monitoring applications and handling fault detection and recovery, > ScanCore sits on top of both, gathering disparate data, analyzing it and > making "big picture" decisions on how to best protect the hosted servers. > > Examples; > > 1. All servers are on node 1, and node 1 suffers a cooling fan failure. > ScanCore compares against node 2's health, waits a period of time in > case it is a transient fault and the autonomously live-migrates the > servers to node 2. Later, node 2 suffers a drive failure, degrading the > underlying RAID array. ScanCore can then compare the relative risks of a > failed fan versus a degraded RAID array, determine that the failed fan > is less risky and automatically migrate the servers back to node 1. If a > hot-spare kicks in and the array returns to an Optimal state, ScanCore > will again migrate the servers back to node 2. When node 1's fan failure > is finally repaired, the servers stay on node 2 as there is no benefit > to migrating as now both nodes are equally healthy. > > 2. Input power is lost to one UPS, but not the second UPS. ScanCore > knows that good power is available and, so, doesn't react in any way. If > input power is lost to both UPSes, however, then ScanCore will decide > that the greatest risk the server availability is no longer unexpected > component failure, but instead depleting the batteries. Given this, it > will decide that the best option to protect the hosted servers is to > shed load and maximize run time. if the power stays out for too long, > then ScanCore will determine hard off is imminent, and decide to > gracefully shut down all hosted servers, withdraw and power off. Later, > when power returns, the Striker dashboards will monitor the charge rate > of the UPSes and as soon as it is safe to do so, restart the nodes and > restore full redundancy. > > 3. Similar to case 2, ScanCore can gather temperature data from multiple > sources and use this data to distinguish localized cooling failures from > environmental cooling failures, like the loss of an HVAC or AC system. > If the former case, ScanCore will migrate servers off and, if critical > temperatures are reached, shut down systems before hardware damage can > occur. In the later case, ScanCore will decide that minimizing thermal > output is the best way to protect hosted servers and, so, will shed load > to accomplish this. If necessary to avoid damage, ScanCore will perform > a full shut down. Once ScanCore (on the low-powered Striker dashboards) > determines thermal levels are safe again, it will restart the nodes and > restore full redundancy. > > All of this intelligence is of little use, of course, if it is hard to > build and maintain an Anvil! system. Perhaps the greatest lesson learned > from our old tutorial was that the barrier to entry had to be reduced > dramatically. > > https://www.alteeve.com/w/Build_an_m2_Anvil! > > So, this release also dramatically simplifies how easy it is to go > from bare iron to provisioned, protected servers. Even with no > experience in availability at all, a tech should be able to go from iron > in boxes to provision servers in one or two days. Almost all steps have > been automated, which serves the core goal of maximum reliability by > minimizing the chances for human error. > > This version also introduces the ability to run entirely offline. This > version of the Anvil! is entirely self-contained with internal > repositories making it possible to fully manage an Anvil! with no > external access to the outside world, including rebuilding Striker > dashboards or Anvil! nodes after a major fault and building new Anvil! > node pairs. > > There is so much more that the Anvil! platform can do, but this > announcement is already quite long, so I'll stop here. > > I'm more than happy to answer any questions and, of course, I would > very much love to hear feedback, suggestions, feature requests or > critiques. > > Finally, I want to thank the rest of the team at Alteeve. Without them > keeping the lights on and our customers happy, I would never have been > able to put the time in needed to make this release possible. And, of > course, to all of you for the years of advice, banter and debate. I > still have very much to learn! > > Now, time to start working full time on version 3! >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org