On 05/07/17 14:55, Ken Gaillot wrote: > Wow! I'm looking forward to the September summit talk. >
Me too! Congratulations on the release :) Chrissie > On 07/05/2017 01:52 AM, Digimer wrote: >> Hi all, >> >> I suspect by now, many of you here have heard me talk about the Anvil! >> intelligent availability platform. Today, I am proud to announce that it >> is ready for general use! >> >> https://github.com/ClusterLabs/striker/releases/tag/v2.0.0 >> >> I started five years ago with an idea of building an "Availability >> Appliance". A single machine where any part could be failed, removed and >> replaced without needing a maintenance window. A system with no single >> point of failure anywhere wrapped behind a very simple interface. >> >> The underlying architecture that provides this redundancy was laid >> down years ago as an early tutorial and has been field tested all over >> North America and around the world in the years since. In that time, the >> Anvil! platform has demonstrated over 99.9999% availability! >> >> Starting back then, the goal was to write the web interface that made >> it easy to use the Anvil! platform. Then, about two years ago, I decided >> that an Anvil! could be much, much more than just an appliance. >> >> It could think for itself. >> >> Today, I would like to announce version 2.0.0. This releases >> introduces the ScanCore "decision engine". ScanCore can be thought of as >> a sort of "Layer 3" availability platform. Where Corosync provides >> membership and communications, with Pacemaker (and rgmanager) sitting on >> top monitoring applications and handling fault detection and recovery, >> ScanCore sits on top of both, gathering disparate data, analyzing it and >> making "big picture" decisions on how to best protect the hosted servers. >> >> Examples; >> >> 1. All servers are on node 1, and node 1 suffers a cooling fan failure. >> ScanCore compares against node 2's health, waits a period of time in >> case it is a transient fault and the autonomously live-migrates the >> servers to node 2. Later, node 2 suffers a drive failure, degrading the >> underlying RAID array. ScanCore can then compare the relative risks of a >> failed fan versus a degraded RAID array, determine that the failed fan >> is less risky and automatically migrate the servers back to node 1. If a >> hot-spare kicks in and the array returns to an Optimal state, ScanCore >> will again migrate the servers back to node 2. When node 1's fan failure >> is finally repaired, the servers stay on node 2 as there is no benefit >> to migrating as now both nodes are equally healthy. >> >> 2. Input power is lost to one UPS, but not the second UPS. ScanCore >> knows that good power is available and, so, doesn't react in any way. If >> input power is lost to both UPSes, however, then ScanCore will decide >> that the greatest risk the server availability is no longer unexpected >> component failure, but instead depleting the batteries. Given this, it >> will decide that the best option to protect the hosted servers is to >> shed load and maximize run time. if the power stays out for too long, >> then ScanCore will determine hard off is imminent, and decide to >> gracefully shut down all hosted servers, withdraw and power off. Later, >> when power returns, the Striker dashboards will monitor the charge rate >> of the UPSes and as soon as it is safe to do so, restart the nodes and >> restore full redundancy. >> >> 3. Similar to case 2, ScanCore can gather temperature data from multiple >> sources and use this data to distinguish localized cooling failures from >> environmental cooling failures, like the loss of an HVAC or AC system. >> If the former case, ScanCore will migrate servers off and, if critical >> temperatures are reached, shut down systems before hardware damage can >> occur. In the later case, ScanCore will decide that minimizing thermal >> output is the best way to protect hosted servers and, so, will shed load >> to accomplish this. If necessary to avoid damage, ScanCore will perform >> a full shut down. Once ScanCore (on the low-powered Striker dashboards) >> determines thermal levels are safe again, it will restart the nodes and >> restore full redundancy. >> >> All of this intelligence is of little use, of course, if it is hard to >> build and maintain an Anvil! system. Perhaps the greatest lesson learned >> from our old tutorial was that the barrier to entry had to be reduced >> dramatically. >> >> https://www.alteeve.com/w/Build_an_m2_Anvil! >> >> So, this release also dramatically simplifies how easy it is to go >> from bare iron to provisioned, protected servers. Even with no >> experience in availability at all, a tech should be able to go from iron >> in boxes to provision servers in one or two days. Almost all steps have >> been automated, which serves the core goal of maximum reliability by >> minimizing the chances for human error. >> >> This version also introduces the ability to run entirely offline. This >> version of the Anvil! is entirely self-contained with internal >> repositories making it possible to fully manage an Anvil! with no >> external access to the outside world, including rebuilding Striker >> dashboards or Anvil! nodes after a major fault and building new Anvil! >> node pairs. >> >> There is so much more that the Anvil! platform can do, but this >> announcement is already quite long, so I'll stop here. >> >> I'm more than happy to answer any questions and, of course, I would >> very much love to hear feedback, suggestions, feature requests or >> critiques. >> >> Finally, I want to thank the rest of the team at Alteeve. Without them >> keeping the lights on and our customers happy, I would never have been >> able to put the time in needed to make this release possible. And, of >> course, to all of you for the years of advice, banter and debate. I >> still have very much to learn! >> >> Now, time to start working full time on version 3! >> > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org