apurtell commented on code in PR #2461: URL: https://github.com/apache/phoenix/pull/2461#discussion_r3290020877
########## src/main/tla/ConsistentFailover/markdown/Admin.md: ########## @@ -0,0 +1,151 @@ +# Admin -- Operator-Initiated Actions + +**Source:** [`Admin.tla`](../Admin.tla) + +## Overview + +`Admin` models the human operator (Admin actor) who drives failover and abort via the `PhoenixHAAdminTool` CLI, which delegates to `HAGroupStoreManager` coprocessor endpoints. This module contains four actions: `AdminStartFailover` (initiate failover), `AdminAbortFailover` (abort an in-progress failover), `AdminGoOffline` (take a standby cluster offline), and `AdminForceRecover` (force-recover from OFFLINE). The last two are gated on the `UseOfflinePeerDetection` feature flag and model the proactive design for peer OFFLINE detection using `PhoenixHAAdminTool update --force`. + +These are the only actions in the specification that represent deliberate human decisions rather than automated system behavior. All four (`AdminStartFailover`, `AdminAbortFailover`, `AdminGoOffline`, `AdminForceRecover`) receive no fairness in the liveness specifications. The admin is genuinely non-deterministic. The admin might never initiate a failover, or might abort every failover attempt. Imposing fairness on admin actions would force unrealistic guarantees about human behavior. + +### Modeling Choice: Direct ZK Writes + +Unlike the peer-reactive transitions in [HAGroupStore.md](HAGroupStore.md), admin actions are direct ZK writes -- they are not watcher-dependent. The admin CLI writes directly to the local ZK znode via the coprocessor endpoint. No `zkPeerConnected` or `zkLocalConnected` guard is needed because the admin tool manages its own ZK connection independently of the `HAGroupStoreClient` watcher infrastructure. `AdminGoOffline` and `AdminForceRecover` also use the `--force` path which writes directly to ZK, so no `zkLocalConnected` guard is needed for those actions either. + +## Implementation Traceability + +| TLA+ Action | Java Source | +|---|---| +| `AdminStartFailover(c)` | `HAGroupStoreManager.initiateFailoverOnActiveCluster()` L375-400 | +| `AdminAbortFailover(c)` | `HAGroupStoreManager.setHAGroupStatusToAbortToStandby()` L419-425; also clears `failoverPending` (models `abortFailoverListener` L173-185) | +| `AdminGoOffline(c)` | `PhoenixHAAdminTool update --state OFFLINE` (gated on `UseOfflinePeerDetection`) | +| `AdminForceRecover(c)` | `PhoenixHAAdminTool update --force --state STANDBY` (OFFLINE -> S; gated on `UseOfflinePeerDetection`) | + +```tla +EXTENDS SpecState, Types +``` + +## AdminStartFailover -- Initiate Failover + +The admin initiates failover on the active cluster. Two paths depending on current state: + +### AIS Path: AIS -> ATS + +The cluster is fully in sync. The OUT directory must be empty and all live RS must be in SYNC mode. DEAD RSes are allowed -- an RS can crash while the cluster is AIS without changing the HA group state. The implementation checks `clusterState = AIS`, not per-RS modes; a DEAD RS is not writing, so the remaining SYNC RSes and empty OUT dir ensure safety. + +### ANIS Path: ANIS -> ANISTS + +The cluster is not in sync (at least one RS is in STORE_AND_FWD). The implementation only validates the current state (ANIS) and peer state -- no `outDirEmpty` or writer-mode guards are needed because the forwarder will drain OUT after the transition. The ANISTS -> ATS transition (`ANISTSToATS` in [HAGroupStore.md](HAGroupStore.md)) guards on `outDirEmpty` and the anti-flapping gate. + +### Peer-State Guard (Both Paths) + +The peer must be in a stable standby state (S or DS) to prevent initiating a new failover during the non-atomic window of a previous failover (where the peer may still be in ATS). Without this guard, the admin could produce an irrecoverable `(ATS, ATS)` or `(ANISTS, ATS)` deadlock where both clusters are transitioning to standby with mutations blocked on both sides. + +### Post-Condition + +Cluster `c` transitions to ATS or ANISTS, both of which map to the ACTIVE_TO_STANDBY role, blocking mutations (`isMutationBlocked() = true`). Review Comment: Let's wait until the implementation changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
