Re: Simulator "Kill Node" Operation - Design Discussion

via GitHub Wed, 25 Mar 2026 03:52:16 -0700


GitHub user hubcio added a comment to the discussion: Simulator "Kill Node" 
Operation - Design Discussion


i went through the simulator code in detail after reading this proposal. the 
outbox-only model is the right call - it matches how a real TCP-based message 
bus actually works (or will work when I implement it, haha... send = stage 
outbound, network = deliver), and it solves the `&mut Network` vs `&self 
Replica` borrow conflict cleanly without `Arc<Mutex>`. @numinnex's elaboration 
is exactly the architecture we're building toward: same `IggyShard` in prod and 
sim, difference is only MessageBus impl + storage impl + runtime driver.

the crash semantics table is mostly correct. "FROM crashed, in network -> 
delivered" is right (already on the wire, sender can't recall). "FROM crashed, 
in outbox -> discarded" is right (never left the process). "TO crashed, in 
network -> dropped" is right - `BLOCK_ALL` link filters are checked during 
`PacketSimulator::step()` at delivery time, so setting `BLOCK_ALL` before 
stepping drops all packets to the crashed node correctly.

the part that needs revision is row 4: "TO restarted node -> delivered, 
consensus rejects stale via view/op checks." this doesn't hold because 
consensus recovery isn't implemented yet. `VsrConsensus::init()` at 
`impls.rs:523` starts at `(view=0, op=0, commit=0)` and immediately sets 
`Status::Normal`. `is_syncing()` is hardcoded to return `false` at 
`impls.rs:569`. `SimJournal` uses `MemStorage` which is `RefCell<Vec<u8>>` - 
drop the replica, lose everything. there is no superblock, no journal replay, 
no state reconstruction path.

what actually happens to a restarted replica depends on whether the cluster has 
changed views since it went down, but both paths are broken.

if the cluster advanced views (say view=3 while the replica comes up at 
view=0): `replicate_preflight` at `plane_helpers.rs:132` checks `header.view > 
consensus.view()` - a prepare from view=3 gets rejected because 3 > 0. the 
replica is dead-on-arrival, rejecting everything. it can learn the current view 
through `handle_start_view_change` at `impls.rs:847` (view advance and 
ViewChange status set at lines 863-865), but that alone doesn't help - 
`replicate_preflight:128` rejects prepares when status != Normal. the replica 
needs a full StartView to resume. `handle_start_view` at `impls.rs:1050` 
transitions to Normal (line 1075), syncs the op counter 
(`sequencer.set_sequence(msg_op)` at line 1083), and clears the pipeline 
(`pipeline.borrow_mut().clear()` at line 1080). but then it does something 
worse: lines 1096-1103 send PrepareOKs for ops `(msg_commit+1)..=msg_op` that 
were never journaled - the pipeline was just wiped and the journal is empty. 
the replica is lying about having replic
 ated data it doesn't have. if its PrepareOK tips a commit quorum, the cluster 
commits an op that only f replicas actually hold - one more failure away from 
data loss.

if the cluster is still in the same view (view=0, no view changes occurred, 
just ops advanced to 51): this is the more direct failure. prepare(view=0, 
op=52, commit=50) passes the view check (0 > 0 = false), 
`advance_commit_number(header.commit)` at `replicate_preflight:137` jumps the 
commit counter to 50 - past 50 ops that were never journaled, and then 
`debug_assert_eq!(header.op, current_op + 1)` at `iggy_partitions.rs:382` fires 
because 52 != 1. debug builds panic, release builds silently create a 50-op gap 
in the log. the phantom commits corrupt DVC quorum decisions in any subsequent 
view change.

either way, a restarted replica with empty journal cannot safely participate.

this isn't a flaw in your proposal - it's a known gap in our consensus layer. 
`replica_crash` (without restart) is feasible right now and independently 
useful. a crashed node that never comes back is equivalent to a permanent 
partition - we can test view change behavior, quorum loss, leader election 
under f failures, all without restart support.

for `replica_restart` to work we need at minimum: (1) a `SimSuperblock` struct 
held by the simulator (not the replica) that persists `(view, log_view, commit, 
op)` across crashes - the simulator plays the role of the disk controller here, 
(2) journal survival - either the simulator holds `HashMap<u8, SimJournal>` and 
swaps them in/out, or we keep the replica object alive and just reset volatile 
consensus state, (3) `VsrConsensus::init_from_durable()` that restores state 
and enters `Status::Recovering` instead of blindly going to `Normal` - the 
replica must refuse to participate until caught up.

i think the right approach is phased PRs:

**PR 1: wire Network into Simulator.** per-replica outboxes replacing the 
shared `MemBus` (rename to `SimOutbox` or similar), phase-separated stepping 
where each phase borrows exactly one resource (drain outboxes -> 
`network.submit()` -> `network.step()` -> dispatch to replicas -> 
`network.tick()`), `step()` becomes `&mut self`. also clean up the dead inbox 
channel at `replica.rs:88` - the sender is dropped immediately, receiver is 
never polled. the `Network` wrapper already has a clean API with `submit()`, 
`step()`, `tick()`, and link filter management - it's ready to be plugged in. 
this is ~300-400 lines and independently valuable for deterministic fault 
injection testing.

**PR 2: add `crash_replica` only (no restart).** `crash_replica(id)` on 
Simulator - `BLOCK_ALL` all links, discard outbox, track crashed replicas in a 
`HashSet<u8>` (not `Option<Replica>` - that pollutes every iterator with 
`.flatten()`), skip dispatch for crashed replicas in the step loop. ~150-200 
lines.

after that, the durability prerequisites and recovery stub are internal work we 
need to do in the consensus layer before `restart_replica` can be implemented. 
once that lands, a PR 3 for restart becomes straightforward.

on the "Paused" state - i'd defer it. it's functionally equivalent to a full 
network partition (`BLOCK_ALL` on all links) which `PacketSimulator` already 
supports, and the mid-commit semantics (paused after journal append but before 
PrepareOk) need careful specification. Up/Down is enough for the first 
iteration.

PR 1 is a great starting point. key files: `core/simulator/src/bus.rs` 
(refactor into per-replica outbox), `core/simulator/src/lib.rs` 
(phase-separated step, integrate Network), `core/simulator/src/replica.rs` 
(remove dead inbox, accept per-replica outbox).


GitHub link: 
https://github.com/apache/iggy/discussions/3017#discussioncomment-16307957

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: Simulator "Kill Node" Operation - Design Discussion

Reply via email to