[ 
https://issues.apache.org/jira/browse/HBASE-29975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-29975:
----------------------------------------
    Description: 
TLA+ is a formal specification language created by Leslie Lamport for designing 
and verifying concurrent and distributed systems. The name stands for the 
Temporal Logic of Actions, a mathematical framework that combines first-order 
logic with temporal operators to reason about how system state evolves over 
time. TLA+ does not produce executable code. Instead it produces a precise, 
machine-checkable mathematical model of a system's behavior that can be 
exhaustively verified against safety and liveness properties. When the model is 
a high-fidelity representation of the real system, proposed design and 
architectural changes can be checked against the full space of possible 
executions, surfacing critical logic bugs at design time, before any code is 
written. This can save weeks or months of development effort that would 
otherwise be spent discovering and debugging subtle concurrency issues in a 
running system.

The HBase AssignmentManager is a core component of the HBase master that 
manages the lifecycle of regions across a cluster of RegionServers. It 
coordinates region assignment, unassignment, moves, and reopens; handles 
RegionServer crashes through the ServerCrashProcedure (SCP); and recovers its 
own state after a master crash through a durable procedure store. The 
correctness of these interactions is critical. Bugs can cause data loss (double 
assignment / split-brain writes), data unavailability (lost or stuck regions), 
or cluster hangs (deadlocked procedures).

This TLA+ specification models the AssignmentManager as a state machine with 20 
state variables capturing:

* Region lifecycle: In-memory master state (regionState) and persistent 
hbase:meta state (metaTable), tracking regions through OFFLINE → OPENING → OPEN 
→ CLOSING → CLOSED (and the failure states FAILED_OPEN and ABNORMALLY_CLOSED).
* Asynchronous RPC channels: master-to-RS commands (dispatchedOps) and 
RS-to-master transition reports (pendingReports), modeled as sets of records 
with non-deterministic delivery.
* Procedure state: Inlined into region state records (type, step, target 
server, retry count), with a durable procedure store (procStore) that survives 
master crashes.
* Server liveness: Per-server online/crashed state, ZooKeeper ephemeral nodes, 
and WAL fencing state.
* Crash recovery: Multi-step ServerCrashProcedure (detect → assign meta → 
snapshot regions → fence WALs → reassign) and master crash/recovery (volatile 
state lost, durable state replayed).
* PEWorker thread pool: Available worker count (availableWorkers), async 
suspension (suspendedOnMeta), and sync blocking (blockedOnMeta) when hbase:meta 
is unavailable during SCP meta-reassignment.
* Keyspace infrastructure: Per-region key range (regionKeyRange) mapping each 
region to a [startKey, endKey) interval or NoRange for unused identifiers. 
DeployedRegions tile [0, MaxKey) at Init; unused identifiers are available for 
future split/merge.
* Parent-child procedure framework: Per-region parentProc record tracking 
parent procedure type and step (split or merge), persists across child TRSP 
lifecycles and survives master crash.
* Table identity infrastructure: Per-region regionTable tracking which table a 
region belongs to, with guard predicates (NoTableExclusiveLock, TableLockFree) 
and an invariant (TableLockExclusivity) to ensure exclusive table-level locks 
prevent concurrent region-level operations on the same table.
* Split procedure: Modeled through the parent-child framework.
* Merge procedure: Modeled through the parent-child framework.
* CreateTable procedure: Modeled through the parent-child framework.
* DeleteTable procedure: DeleteTableProcedure modeled with two actions: 
DeleteTablePrepare(t) acquires exclusive table lock on all regions and prepares 
the delete; DeleteTableDone(t) atomically executes it.
* TruncateTable procedure: TruncateTableProcedure modeled with four actions: 
TruncatePrepare(t) acquires exclusive table lock on all regions and prepares 
the operation; TruncateDeleteMeta(t,r) deletes the old regions from hbase:meta; 
TruncateCreateMeta(t, r) creates a new region and adds it to hbase:meta; and 
TruncateDone(t) completes the operation.

The specification defines 34 safety invariants verified at every reachable 
state, including the NoDoubleAssignment (no region writable on two servers), 
MetaConsistency (persistent and in-memory state agree), FencingOrder (WALs 
fenced before reassignment), NoLostRegions (no region stuck without a procedure 
after crash recovery), NoPEWorkerDeadlock (thread pool exhaustion detection), 
KeyspaceCoverage (all keys covered by exactly one live region), 
SplitMergeMutualExclusion (split daughters and merged regions cannot have 
active parent procedures), AtMostOneCarryingMeta (at most one server carrying 
meta), MergeCompleteness (completed merge has cleaned-up targets), 
TableLockExclusivity (exclusive table locks prevent concurrent region ops on 
the same table), and others specific to how things are modeled in the spec.

Three liveness properties verify temporal guarantees: MetaEventuallyAssigned 
(meta eventually reassigned after crash), OfflineEventuallyOpen (ASSIGN-bearing 
OFFLINE region eventually opens), and SCPEventuallyDone (started SCP eventually 
completes). Two action constraints enforce transition validity and SCP 
monotonicity. One state constraint (SplitMergeConstraint) bounds concurrent 
split/merge procedures for TLC tractability.

The model checker runs in two tiers: fast exhaustive verification at 3 regions 
/ 2 servers (1 deployed + 2 unused for split daughters), and deep random 
simulation at 9r/3s with extended retries and merge enabled. Configurable 
"quirk" flags allow toggling known implementation bugs to correctly adhere to 
implementation semantics, reproduce failures and validate fixes.

The model is incomplete but under active development and can found at:

https://github.com/apurtell/hbase/tree/WORK-architecture/src/main/spec

  was:
TLA+ is a formal specification language for designing and verifying concurrent 
and distributed systems. The name stands for the "Temporal Logic of Actions", a 
mathematical framework that combines first-order logic with temporal operators 
to reason about how system state evolves over time. When the model is a 
high-fidelity representation of the real system, proposed design and 
architectural changes can be checked against the full space of possible 
executions, surfacing logic bugs at design time, before any code is written. 
This can save weeks or months of development effort that would otherwise be 
spent discovering and debugging subtle concurrency issues in a running system.

A TLA+ specification describes a system as a state machine, an initial state 
predicate (Init), a next-state relation (Next) that defines every legal 
transition, and a collection of invariants, properties that must hold in every 
reachable state. The TLC model checker then systematically explores every 
possible execution of this state machine, checking each property at every 
state. If a property is violated, TLC produces a minimal counterexample trace 
showing the exact sequence of steps that led to the failure.

The HBase AssignmentManager is a core component of the HBase master that 
manages the lifecycle of regions across a cluster of RegionServers. It 
coordinates region assignment, unassignment, moves, and reopens, handles 
RegionServer crashes through the ServerCrashProcedure, and recovers its own 
state after a master crash through a durable procedure store. The correctness 
of these interactions is critical.

This TLA+ specification models the AssignmentManager as a state machine 
capturing:

* Region lifecycle
* Asynchronous RPC channels: Master-to-RS commands and RS-to-master transition 
reports
* Procedure state: State records (type, step, target  server, retry count), 
with a durable procedure store that survives master crashes
* Server liveness: Per-server online/crashed state, ZooKeeper ephemeral nodes, 
and WAL fencing state
* Crash recovery: Multi-step ServerCrashProcedure 
* PEWorker thread pool

The specification defines safety invariants verified at every reachable state, 
including NoDoubleAssignment (no region writable on two servers), 
MetaConsistency (persistent and in-memory state agree), FencingOrder (WALs 
fenced before reassignment), NoLostRegions (no region stuck without a procedure 
after crash recovery), and NoPEWorkerDeadlock (thread pool exhaustion 
detection). The liveness property (MetaEventuallyAssigned) verifies that 
hbase:meta is eventually reassigned after a crash. Action constraints enforce 
transition validity and SCP monotonicity.

The model is incomplete but under active development and can found at:

https://github.com/apurtell/hbase/tree/WORK-architecture/src/main/spec


> TLA+ specification of the AssignmentManager
> -------------------------------------------
>
>                 Key: HBASE-29975
>                 URL: https://issues.apache.org/jira/browse/HBASE-29975
>             Project: HBase
>          Issue Type: Task
>          Components: master, proc-v2, Region Assignment
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>
> TLA+ is a formal specification language created by Leslie Lamport for 
> designing and verifying concurrent and distributed systems. The name stands 
> for the Temporal Logic of Actions, a mathematical framework that combines 
> first-order logic with temporal operators to reason about how system state 
> evolves over time. TLA+ does not produce executable code. Instead it produces 
> a precise, machine-checkable mathematical model of a system's behavior that 
> can be exhaustively verified against safety and liveness properties. When the 
> model is a high-fidelity representation of the real system, proposed design 
> and architectural changes can be checked against the full space of possible 
> executions, surfacing critical logic bugs at design time, before any code is 
> written. This can save weeks or months of development effort that would 
> otherwise be spent discovering and debugging subtle concurrency issues in a 
> running system.
> The HBase AssignmentManager is a core component of the HBase master that 
> manages the lifecycle of regions across a cluster of RegionServers. It 
> coordinates region assignment, unassignment, moves, and reopens; handles 
> RegionServer crashes through the ServerCrashProcedure (SCP); and recovers its 
> own state after a master crash through a durable procedure store. The 
> correctness of these interactions is critical. Bugs can cause data loss 
> (double assignment / split-brain writes), data unavailability (lost or stuck 
> regions), or cluster hangs (deadlocked procedures).
> This TLA+ specification models the AssignmentManager as a state machine with 
> 20 state variables capturing:
> * Region lifecycle: In-memory master state (regionState) and persistent 
> hbase:meta state (metaTable), tracking regions through OFFLINE → OPENING → 
> OPEN → CLOSING → CLOSED (and the failure states FAILED_OPEN and 
> ABNORMALLY_CLOSED).
> * Asynchronous RPC channels: master-to-RS commands (dispatchedOps) and 
> RS-to-master transition reports (pendingReports), modeled as sets of records 
> with non-deterministic delivery.
> * Procedure state: Inlined into region state records (type, step, target 
> server, retry count), with a durable procedure store (procStore) that 
> survives master crashes.
> * Server liveness: Per-server online/crashed state, ZooKeeper ephemeral 
> nodes, and WAL fencing state.
> * Crash recovery: Multi-step ServerCrashProcedure (detect → assign meta → 
> snapshot regions → fence WALs → reassign) and master crash/recovery (volatile 
> state lost, durable state replayed).
> * PEWorker thread pool: Available worker count (availableWorkers), async 
> suspension (suspendedOnMeta), and sync blocking (blockedOnMeta) when 
> hbase:meta is unavailable during SCP meta-reassignment.
> * Keyspace infrastructure: Per-region key range (regionKeyRange) mapping each 
> region to a [startKey, endKey) interval or NoRange for unused identifiers. 
> DeployedRegions tile [0, MaxKey) at Init; unused identifiers are available 
> for future split/merge.
> * Parent-child procedure framework: Per-region parentProc record tracking 
> parent procedure type and step (split or merge), persists across child TRSP 
> lifecycles and survives master crash.
> * Table identity infrastructure: Per-region regionTable tracking which table 
> a region belongs to, with guard predicates (NoTableExclusiveLock, 
> TableLockFree) and an invariant (TableLockExclusivity) to ensure exclusive 
> table-level locks prevent concurrent region-level operations on the same 
> table.
> * Split procedure: Modeled through the parent-child framework.
> * Merge procedure: Modeled through the parent-child framework.
> * CreateTable procedure: Modeled through the parent-child framework.
> * DeleteTable procedure: DeleteTableProcedure modeled with two actions: 
> DeleteTablePrepare(t) acquires exclusive table lock on all regions and 
> prepares the delete; DeleteTableDone(t) atomically executes it.
> * TruncateTable procedure: TruncateTableProcedure modeled with four actions: 
> TruncatePrepare(t) acquires exclusive table lock on all regions and prepares 
> the operation; TruncateDeleteMeta(t,r) deletes the old regions from 
> hbase:meta; TruncateCreateMeta(t, r) creates a new region and adds it to 
> hbase:meta; and TruncateDone(t) completes the operation.
> The specification defines 34 safety invariants verified at every reachable 
> state, including the NoDoubleAssignment (no region writable on two servers), 
> MetaConsistency (persistent and in-memory state agree), FencingOrder (WALs 
> fenced before reassignment), NoLostRegions (no region stuck without a 
> procedure after crash recovery), NoPEWorkerDeadlock (thread pool exhaustion 
> detection), KeyspaceCoverage (all keys covered by exactly one live region), 
> SplitMergeMutualExclusion (split daughters and merged regions cannot have 
> active parent procedures), AtMostOneCarryingMeta (at most one server carrying 
> meta), MergeCompleteness (completed merge has cleaned-up targets), 
> TableLockExclusivity (exclusive table locks prevent concurrent region ops on 
> the same table), and others specific to how things are modeled in the spec.
> Three liveness properties verify temporal guarantees: MetaEventuallyAssigned 
> (meta eventually reassigned after crash), OfflineEventuallyOpen 
> (ASSIGN-bearing OFFLINE region eventually opens), and SCPEventuallyDone 
> (started SCP eventually completes). Two action constraints enforce transition 
> validity and SCP monotonicity. One state constraint (SplitMergeConstraint) 
> bounds concurrent split/merge procedures for TLC tractability.
> The model checker runs in two tiers: fast exhaustive verification at 3 
> regions / 2 servers (1 deployed + 2 unused for split daughters), and deep 
> random simulation at 9r/3s with extended retries and merge enabled. 
> Configurable "quirk" flags allow toggling known implementation bugs to 
> correctly adhere to implementation semantics, reproduce failures and validate 
> fixes.
> The model is incomplete but under active development and can found at:
> https://github.com/apurtell/hbase/tree/WORK-architecture/src/main/spec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to