Hello HA enthusiasts, I've come by an interesting article on the topic of how high availability (possibly, I couldn't witness this first hand since I don't have a time machine, but some of you can perhaps comment if the picture matches own experience) historically evolved from the perspective of database engines. In part, it may be a promo for a particular product but this is in no way an attempt to endorse it -- the text comes informative on its own merit:
<https://www.cockroachlabs.com/blog/brief-history-high-availability/> It perpetuates how the first, easy step towards "ideal HA" was an active-passive setup, moreover with statefull resources (like DBs) first using synchronous replication of the state and hence their overall availability relying on the backup being functional, then asynchronously, allowing for losing bits. (Note that any non-trivial application will always require some notion of rather persistent state -- as mentioned several times in this venue, stateless services do not need to bother with all the "HA coordination" burden since there are typically light-weight alternatives for "bring services up in this order" type of tasks, hence I explicitly exclude them from further discussion). Then it talks about "sharding" (I must admit I haven't heard this term before), splitting a two-node active-passive monolith into multiple active-passive pairs, using some domain-specific cuts (like primary key ranges for tables in DB) + some kind of gateway in front of them and used to route the requests to the corresponding pair. Finally, the evolution brought us to active-active setups, that typically solve the consistency issues amongst partly independent nodes with after-the-fact conflict reconciliation. Alternative to this is an before-the-fact consensus negotiation on what the next "true" shared state will be -- they call this arrangement multi-active in the arrangement, and apparently, it means that the main mechanisms, membership and consensus, of corosync-pacemaker stack are duplicated privately on this resource level. * * * This brings me to what I want to discuss -- relevancy of corosync-pacemaker clusters in the light of increasingly common resource-level "private" clustering (amongst other trends like a push towards containerization), and how to perhaps rearticulate it's mission to stay relevant for years to come. I perceive the pacemaker's biggest value currently in: * HA-fying plain non-distributed services, either as active-passive or even active-active provided that "shared state" problem is either non-existent or off-loaded elsewhere -- distributed file system/storage, distributed DB, etc. * helping in the "last mile" for multiple-actors-ready active-passive services (matches multi-role resource agent arrangement) * multisite/cluster-of-clusters handling in combination with booth and their (almost) arbitrarily complex combinations, all while achieving proper sanity through node-level isolation should the HA-damaging failures occur. On the other hand, with a standalone self-clustering resources (at best, they could reuse the facilities of corosync alone for their function), perhaps the only value added would be this "isolation" part, but then, stonith-ng/pacemaker-fenced together with static configuration file would be all that's needed so that such resource can hook into it. Note that both "sharding gateway/router", conflict reconciliation and perhaps even consensus negotiation appear to be highly application specific. To be relevant in those contexts, the opposite to "external wrapping" would be needed -- making the framework complete, offering the library/API so that the applications are built on top of this natively. An example of this I've seen when doing a brief research some time ago is <https://github.com/NetComposer/nkcluster> (on that note, Erlang was designed specifically with fault-tolerant, resilient and distributed applications in mind, making me wonder if it was ever considered originally in what later became pacemaker). Also, one of the fields where pacemaker used to be very helpful was a concise startup/shutdown ordering amongst multiple on-node services. This is partially obviated with smart init managers, most blatantly systemd on Linux platform, playing whole another league than old, inflexible init systems of the past when the foundation of pacemaker was laid out. * * * Please, don't take this as a blasphemy, I am just trying to put my head out of the tunnel (or sand, if you want), to view the value of corosync-pacemaker stack in the IT infrastructures of today and future, and to gather feedback on this topic, perhaps together with ideas how to stay indeed relevant amongst all the "private clustering", management and orchestration of resources proliferation we can observe for the past years (which makes the surface slightly different than it was when heartbeat [and Red Hat Cluster Suite] was the thing). Please share your thoughts with me/us, even if it will not be the most encouraging thing to hear, since - staying realistic is important - staying relevant is what prevents becoming a fossil tomorrow :-) Happy World Teacher's Day. -- Jan (Poki)
pgpNlFaMz08Zc.pgp
Description: PGP signature
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org