apurtell commented on code in PR #2461:
URL: https://github.com/apache/phoenix/pull/2461#discussion_r3290020877


##########
src/main/tla/ConsistentFailover/markdown/Admin.md:
##########
@@ -0,0 +1,151 @@
+# Admin -- Operator-Initiated Actions
+
+**Source:** [`Admin.tla`](../Admin.tla)
+
+## Overview
+
+`Admin` models the human operator (Admin actor) who drives failover and abort 
via the `PhoenixHAAdminTool` CLI, which delegates to `HAGroupStoreManager` 
coprocessor endpoints. This module contains four actions: `AdminStartFailover` 
(initiate failover), `AdminAbortFailover` (abort an in-progress failover), 
`AdminGoOffline` (take a standby cluster offline), and `AdminForceRecover` 
(force-recover from OFFLINE). The last two are gated on the 
`UseOfflinePeerDetection` feature flag and model the proactive design for peer 
OFFLINE detection using `PhoenixHAAdminTool update --force`.
+
+These are the only actions in the specification that represent deliberate 
human decisions rather than automated system behavior. All four 
(`AdminStartFailover`, `AdminAbortFailover`, `AdminGoOffline`, 
`AdminForceRecover`) receive no fairness in the liveness specifications. The 
admin is genuinely non-deterministic. The admin might never initiate a 
failover, or might abort every failover attempt. Imposing fairness on admin 
actions would force unrealistic guarantees about human behavior.
+
+### Modeling Choice: Direct ZK Writes
+
+Unlike the peer-reactive transitions in [HAGroupStore.md](HAGroupStore.md), 
admin actions are direct ZK writes -- they are not watcher-dependent. The admin 
CLI writes directly to the local ZK znode via the coprocessor endpoint. No 
`zkPeerConnected` or `zkLocalConnected` guard is needed because the admin tool 
manages its own ZK connection independently of the `HAGroupStoreClient` watcher 
infrastructure. `AdminGoOffline` and `AdminForceRecover` also use the `--force` 
path which writes directly to ZK, so no `zkLocalConnected` guard is needed for 
those actions either.
+
+## Implementation Traceability
+
+| TLA+ Action | Java Source |
+|---|---|
+| `AdminStartFailover(c)` | 
`HAGroupStoreManager.initiateFailoverOnActiveCluster()` L375-400 |
+| `AdminAbortFailover(c)` | 
`HAGroupStoreManager.setHAGroupStatusToAbortToStandby()` L419-425; also clears 
`failoverPending` (models `abortFailoverListener` L173-185) |
+| `AdminGoOffline(c)` | `PhoenixHAAdminTool update --state OFFLINE` (gated on 
`UseOfflinePeerDetection`) |
+| `AdminForceRecover(c)` | `PhoenixHAAdminTool update --force --state STANDBY` 
(OFFLINE -> S; gated on `UseOfflinePeerDetection`) |
+
+```tla
+EXTENDS SpecState, Types
+```
+
+## AdminStartFailover -- Initiate Failover
+
+The admin initiates failover on the active cluster. Two paths depending on 
current state:
+
+### AIS Path: AIS -> ATS
+
+The cluster is fully in sync. The OUT directory must be empty and all live RS 
must be in SYNC mode. DEAD RSes are allowed -- an RS can crash while the 
cluster is AIS without changing the HA group state. The implementation checks 
`clusterState = AIS`, not per-RS modes; a DEAD RS is not writing, so the 
remaining SYNC RSes and empty OUT dir ensure safety.
+
+### ANIS Path: ANIS -> ANISTS
+
+The cluster is not in sync (at least one RS is in STORE_AND_FWD). The 
implementation only validates the current state (ANIS) and peer state -- no 
`outDirEmpty` or writer-mode guards are needed because the forwarder will drain 
OUT after the transition. The ANISTS -> ATS transition (`ANISTSToATS` in 
[HAGroupStore.md](HAGroupStore.md)) guards on `outDirEmpty` and the 
anti-flapping gate.
+
+### Peer-State Guard (Both Paths)
+
+The peer must be in a stable standby state (S or DS) to prevent initiating a 
new failover during the non-atomic window of a previous failover (where the 
peer may still be in ATS). Without this guard, the admin could produce an 
irrecoverable `(ATS, ATS)` or `(ANISTS, ATS)` deadlock where both clusters are 
transitioning to standby with mutations blocked on both sides.
+
+### Post-Condition
+
+Cluster `c` transitions to ATS or ANISTS, both of which map to the 
ACTIVE_TO_STANDBY role, blocking mutations (`isMutationBlocked() = true`).

Review Comment:
   Let's wait until the implementation changes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to