[
https://issues.apache.org/jira/browse/HBASE-29975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell updated HBASE-29975:
----------------------------------------
Description:
TLA+ is a formal specification language created by Leslie Lamport for designing
and verifying concurrent and distributed systems. The name stands for the
Temporal Logic of Actions, a mathematical framework that combines first-order
logic with temporal operators to reason about how system state evolves over
time. TLA+ does not produce executable code. Instead it produces a precise,
machine-checkable mathematical model of a system's behavior that can be
exhaustively verified against safety and liveness properties. When the model is
a high-fidelity representation of the real system, proposed design and
architectural changes can be checked against the full space of possible
executions, surfacing critical logic bugs at design time, before any code is
written. This can save weeks or months of development effort that would
otherwise be spent discovering and debugging subtle concurrency issues in a
running system.
The HBase AssignmentManager is a core component of the HBase master that
manages the lifecycle of regions across a cluster of RegionServers. It
coordinates region assignment, unassignment, moves, and reopens; handles
RegionServer crashes through the ServerCrashProcedure (SCP); and recovers its
own state after a master crash through a durable procedure store. The
correctness of these interactions is critical. Bugs can cause data loss (double
assignment / split-brain writes), data unavailability (lost or stuck regions),
or cluster hangs (deadlocked procedures).
This TLA+ specification models the AssignmentManager as a state machine with 20
state variables capturing:
* Region lifecycle: In-memory master state (regionState) and persistent
hbase:meta state (metaTable), tracking regions through OFFLINE → OPENING → OPEN
→ CLOSING → CLOSED (and the failure states FAILED_OPEN and ABNORMALLY_CLOSED).
* Asynchronous RPC channels: master-to-RS commands (dispatchedOps) and
RS-to-master transition reports (pendingReports), modeled as sets of records
with non-deterministic delivery.
* Procedure state: Inlined into region state records (type, step, target
server, retry count), with a durable procedure store (procStore) that survives
master crashes.
* Server liveness: Per-server online/crashed state, ZooKeeper ephemeral nodes,
and WAL fencing state.
* Crash recovery: Multi-step ServerCrashProcedure (detect → assign meta →
snapshot regions → fence WALs → reassign) and master crash/recovery (volatile
state lost, durable state replayed).
* PEWorker thread pool: Available worker count (availableWorkers), async
suspension (suspendedOnMeta), and sync blocking (blockedOnMeta) when hbase:meta
is unavailable during SCP meta-reassignment.
* Keyspace infrastructure: Per-region key range (regionKeyRange) mapping each
region to a [startKey, endKey) interval or NoRange for unused identifiers.
DeployedRegions tile [0, MaxKey) at Init; unused identifiers are available for
future split/merge.
* Parent-child procedure framework: Per-region parentProc record tracking
parent procedure type and step (split or merge), persists across child TRSP
lifecycles and survives master crash.
* Table identity infrastructure: Per-region regionTable tracking which table a
region belongs to, with guard predicates (NoTableExclusiveLock, TableLockFree)
and an invariant (TableLockExclusivity) to ensure exclusive table-level locks
prevent concurrent region-level operations on the same table.
* Split procedure: Modeled through the parent-child framework.
* Merge procedure: Modeled through the parent-child framework.
* CreateTable procedure: Modeled through the parent-child framework.
* DeleteTable procedure: DeleteTableProcedure modeled with two actions:
DeleteTablePrepare(t) acquires exclusive table lock on all regions and prepares
the delete; DeleteTableDone(t) atomically executes it.
* TruncateTable procedure: TruncateTableProcedure modeled with four actions:
TruncatePrepare(t) acquires exclusive table lock on all regions and prepares
the operation; TruncateDeleteMeta(t,r) deletes the old regions from hbase:meta;
TruncateCreateMeta(t, r) creates a new region and adds it to hbase:meta; and
TruncateDone(t) completes the operation.
The specification defines 34 safety invariants verified at every reachable
state, including the NoDoubleAssignment (no region writable on two servers),
MetaConsistency (persistent and in-memory state agree), FencingOrder (WALs
fenced before reassignment), NoLostRegions (no region stuck without a procedure
after crash recovery), NoPEWorkerDeadlock (thread pool exhaustion detection),
KeyspaceCoverage (all keys covered by exactly one live region),
SplitMergeMutualExclusion (split daughters and merged regions cannot have
active parent procedures), AtMostOneCarryingMeta (at most one server carrying
meta), MergeCompleteness (completed merge has cleaned-up targets),
TableLockExclusivity (exclusive table locks prevent concurrent region ops on
the same table), and others specific to how things are modeled in the spec.
Three liveness properties verify temporal guarantees: MetaEventuallyAssigned
(meta eventually reassigned after crash), OfflineEventuallyOpen (ASSIGN-bearing
OFFLINE region eventually opens), and SCPEventuallyDone (started SCP eventually
completes). Two action constraints enforce transition validity and SCP
monotonicity. One state constraint (SplitMergeConstraint) bounds concurrent
split/merge procedures for TLC tractability.
The model checker runs in two tiers: fast exhaustive verification at 3 regions
/ 2 servers (1 deployed + 2 unused for split daughters), and deep random
simulation at 9r/3s with extended retries and merge enabled. Configurable
"quirk" flags allow toggling known implementation bugs to correctly adhere to
implementation semantics, reproduce failures and validate fixes.
The model is incomplete but under active development and can found at:
https://github.com/apurtell/hbase/tree/WORK-architecture/src/main/spec
was:
TLA+ is a formal specification language for designing and verifying concurrent
and distributed systems. The name stands for the "Temporal Logic of Actions", a
mathematical framework that combines first-order logic with temporal operators
to reason about how system state evolves over time. When the model is a
high-fidelity representation of the real system, proposed design and
architectural changes can be checked against the full space of possible
executions, surfacing logic bugs at design time, before any code is written.
This can save weeks or months of development effort that would otherwise be
spent discovering and debugging subtle concurrency issues in a running system.
A TLA+ specification describes a system as a state machine, an initial state
predicate (Init), a next-state relation (Next) that defines every legal
transition, and a collection of invariants, properties that must hold in every
reachable state. The TLC model checker then systematically explores every
possible execution of this state machine, checking each property at every
state. If a property is violated, TLC produces a minimal counterexample trace
showing the exact sequence of steps that led to the failure.
The HBase AssignmentManager is a core component of the HBase master that
manages the lifecycle of regions across a cluster of RegionServers. It
coordinates region assignment, unassignment, moves, and reopens, handles
RegionServer crashes through the ServerCrashProcedure, and recovers its own
state after a master crash through a durable procedure store. The correctness
of these interactions is critical.
This TLA+ specification models the AssignmentManager as a state machine
capturing:
* Region lifecycle
* Asynchronous RPC channels: Master-to-RS commands and RS-to-master transition
reports
* Procedure state: State records (type, step, target server, retry count),
with a durable procedure store that survives master crashes
* Server liveness: Per-server online/crashed state, ZooKeeper ephemeral nodes,
and WAL fencing state
* Crash recovery: Multi-step ServerCrashProcedure
* PEWorker thread pool
The specification defines safety invariants verified at every reachable state,
including NoDoubleAssignment (no region writable on two servers),
MetaConsistency (persistent and in-memory state agree), FencingOrder (WALs
fenced before reassignment), NoLostRegions (no region stuck without a procedure
after crash recovery), and NoPEWorkerDeadlock (thread pool exhaustion
detection). The liveness property (MetaEventuallyAssigned) verifies that
hbase:meta is eventually reassigned after a crash. Action constraints enforce
transition validity and SCP monotonicity.
The model is incomplete but under active development and can found at:
https://github.com/apurtell/hbase/tree/WORK-architecture/src/main/spec
> TLA+ specification of the AssignmentManager
> -------------------------------------------
>
> Key: HBASE-29975
> URL: https://issues.apache.org/jira/browse/HBASE-29975
> Project: HBase
> Issue Type: Task
> Components: master, proc-v2, Region Assignment
> Reporter: Andrew Kyle Purtell
> Assignee: Andrew Kyle Purtell
> Priority: Major
>
> TLA+ is a formal specification language created by Leslie Lamport for
> designing and verifying concurrent and distributed systems. The name stands
> for the Temporal Logic of Actions, a mathematical framework that combines
> first-order logic with temporal operators to reason about how system state
> evolves over time. TLA+ does not produce executable code. Instead it produces
> a precise, machine-checkable mathematical model of a system's behavior that
> can be exhaustively verified against safety and liveness properties. When the
> model is a high-fidelity representation of the real system, proposed design
> and architectural changes can be checked against the full space of possible
> executions, surfacing critical logic bugs at design time, before any code is
> written. This can save weeks or months of development effort that would
> otherwise be spent discovering and debugging subtle concurrency issues in a
> running system.
> The HBase AssignmentManager is a core component of the HBase master that
> manages the lifecycle of regions across a cluster of RegionServers. It
> coordinates region assignment, unassignment, moves, and reopens; handles
> RegionServer crashes through the ServerCrashProcedure (SCP); and recovers its
> own state after a master crash through a durable procedure store. The
> correctness of these interactions is critical. Bugs can cause data loss
> (double assignment / split-brain writes), data unavailability (lost or stuck
> regions), or cluster hangs (deadlocked procedures).
> This TLA+ specification models the AssignmentManager as a state machine with
> 20 state variables capturing:
> * Region lifecycle: In-memory master state (regionState) and persistent
> hbase:meta state (metaTable), tracking regions through OFFLINE → OPENING →
> OPEN → CLOSING → CLOSED (and the failure states FAILED_OPEN and
> ABNORMALLY_CLOSED).
> * Asynchronous RPC channels: master-to-RS commands (dispatchedOps) and
> RS-to-master transition reports (pendingReports), modeled as sets of records
> with non-deterministic delivery.
> * Procedure state: Inlined into region state records (type, step, target
> server, retry count), with a durable procedure store (procStore) that
> survives master crashes.
> * Server liveness: Per-server online/crashed state, ZooKeeper ephemeral
> nodes, and WAL fencing state.
> * Crash recovery: Multi-step ServerCrashProcedure (detect → assign meta →
> snapshot regions → fence WALs → reassign) and master crash/recovery (volatile
> state lost, durable state replayed).
> * PEWorker thread pool: Available worker count (availableWorkers), async
> suspension (suspendedOnMeta), and sync blocking (blockedOnMeta) when
> hbase:meta is unavailable during SCP meta-reassignment.
> * Keyspace infrastructure: Per-region key range (regionKeyRange) mapping each
> region to a [startKey, endKey) interval or NoRange for unused identifiers.
> DeployedRegions tile [0, MaxKey) at Init; unused identifiers are available
> for future split/merge.
> * Parent-child procedure framework: Per-region parentProc record tracking
> parent procedure type and step (split or merge), persists across child TRSP
> lifecycles and survives master crash.
> * Table identity infrastructure: Per-region regionTable tracking which table
> a region belongs to, with guard predicates (NoTableExclusiveLock,
> TableLockFree) and an invariant (TableLockExclusivity) to ensure exclusive
> table-level locks prevent concurrent region-level operations on the same
> table.
> * Split procedure: Modeled through the parent-child framework.
> * Merge procedure: Modeled through the parent-child framework.
> * CreateTable procedure: Modeled through the parent-child framework.
> * DeleteTable procedure: DeleteTableProcedure modeled with two actions:
> DeleteTablePrepare(t) acquires exclusive table lock on all regions and
> prepares the delete; DeleteTableDone(t) atomically executes it.
> * TruncateTable procedure: TruncateTableProcedure modeled with four actions:
> TruncatePrepare(t) acquires exclusive table lock on all regions and prepares
> the operation; TruncateDeleteMeta(t,r) deletes the old regions from
> hbase:meta; TruncateCreateMeta(t, r) creates a new region and adds it to
> hbase:meta; and TruncateDone(t) completes the operation.
> The specification defines 34 safety invariants verified at every reachable
> state, including the NoDoubleAssignment (no region writable on two servers),
> MetaConsistency (persistent and in-memory state agree), FencingOrder (WALs
> fenced before reassignment), NoLostRegions (no region stuck without a
> procedure after crash recovery), NoPEWorkerDeadlock (thread pool exhaustion
> detection), KeyspaceCoverage (all keys covered by exactly one live region),
> SplitMergeMutualExclusion (split daughters and merged regions cannot have
> active parent procedures), AtMostOneCarryingMeta (at most one server carrying
> meta), MergeCompleteness (completed merge has cleaned-up targets),
> TableLockExclusivity (exclusive table locks prevent concurrent region ops on
> the same table), and others specific to how things are modeled in the spec.
> Three liveness properties verify temporal guarantees: MetaEventuallyAssigned
> (meta eventually reassigned after crash), OfflineEventuallyOpen
> (ASSIGN-bearing OFFLINE region eventually opens), and SCPEventuallyDone
> (started SCP eventually completes). Two action constraints enforce transition
> validity and SCP monotonicity. One state constraint (SplitMergeConstraint)
> bounds concurrent split/merge procedures for TLC tractability.
> The model checker runs in two tiers: fast exhaustive verification at 3
> regions / 2 servers (1 deployed + 2 unused for split daughters), and deep
> random simulation at 9r/3s with extended retries and merge enabled.
> Configurable "quirk" flags allow toggling known implementation bugs to
> correctly adhere to implementation semantics, reproduce failures and validate
> fixes.
> The model is incomplete but under active development and can found at:
> https://github.com/apurtell/hbase/tree/WORK-architecture/src/main/spec
--
This message was sent by Atlassian Jira
(v8.20.10#820010)