GitHub user hubcio added a comment to the discussion: Simulator "Kill Node" Operation - Design Discussion
i went through the simulator code in detail after reading this proposal. the outbox-only model is the right call - it matches how a real TCP-based message bus actually works (or will work when I implement it, haha... send = stage outbound, network = deliver), and it solves the `&mut Network` vs `&self Replica` borrow conflict cleanly without `Arc<Mutex>`. @numinnex's elaboration is exactly the architecture we're building toward: same `IggyShard` in prod and sim, difference is only MessageBus impl + storage impl + runtime driver. the crash semantics table is mostly correct. "FROM crashed, in network -> delivered" is right (already on the wire, sender can't recall). "FROM crashed, in outbox -> discarded" is right (never left the process). "TO crashed, in network -> dropped" is right - `BLOCK_ALL` link filters are checked during `PacketSimulator::step()` at delivery time, so setting `BLOCK_ALL` before stepping drops all packets to the crashed node correctly. the part that needs revision is row 4: "TO restarted node -> delivered, consensus rejects stale via view/op checks." this doesn't hold because consensus recovery isn't implemented yet. `VsrConsensus::init()` at `impls.rs:523` starts at `(view=0, op=0, commit=0)` and immediately sets `Status::Normal`. `is_syncing()` is hardcoded to return `false` at `impls.rs:569`. `SimJournal` uses `MemStorage` which is `RefCell<Vec<u8>>` - drop the replica, lose everything. there is no superblock, no journal replay, no state reconstruction path. what actually happens to a restarted replica depends on whether the cluster has changed views since it went down, but both paths are broken. if the cluster advanced views (say view=3 while the replica comes up at view=0): `replicate_preflight` at `plane_helpers.rs:132` checks `header.view > consensus.view()` - a prepare from view=3 gets rejected because 3 > 0. the replica is dead-on-arrival, rejecting everything. it can learn the current view through `handle_start_view_change` at `impls.rs:847` (view advance and ViewChange status set at lines 863-865), but that alone doesn't help - `replicate_preflight:128` rejects prepares when status != Normal. the replica needs a full StartView to resume. `handle_start_view` at `impls.rs:1050` transitions to Normal (line 1075), syncs the op counter (`sequencer.set_sequence(msg_op)` at line 1083), and clears the pipeline (`pipeline.borrow_mut().clear()` at line 1080). but then it does something worse: lines 1096-1103 send PrepareOKs for ops `(msg_commit+1)..=msg_op` that were never journaled - the pipeline was just wiped and the journal is empty. the replica is lying about having replic ated data it doesn't have. if its PrepareOK tips a commit quorum, the cluster commits an op that only f replicas actually hold - one more failure away from data loss. if the cluster is still in the same view (view=0, no view changes occurred, just ops advanced to 51): this is the more direct failure. prepare(view=0, op=52, commit=50) passes the view check (0 > 0 = false), `advance_commit_number(header.commit)` at `replicate_preflight:137` jumps the commit counter to 50 - past 50 ops that were never journaled, and then `debug_assert_eq!(header.op, current_op + 1)` at `iggy_partitions.rs:382` fires because 52 != 1. debug builds panic, release builds silently create a 50-op gap in the log. the phantom commits corrupt DVC quorum decisions in any subsequent view change. either way, a restarted replica with empty journal cannot safely participate. this isn't a flaw in your proposal - it's a known gap in our consensus layer. `replica_crash` (without restart) is feasible right now and independently useful. a crashed node that never comes back is equivalent to a permanent partition - we can test view change behavior, quorum loss, leader election under f failures, all without restart support. for `replica_restart` to work we need at minimum: (1) a `SimSuperblock` struct held by the simulator (not the replica) that persists `(view, log_view, commit, op)` across crashes - the simulator plays the role of the disk controller here, (2) journal survival - either the simulator holds `HashMap<u8, SimJournal>` and swaps them in/out, or we keep the replica object alive and just reset volatile consensus state, (3) `VsrConsensus::init_from_durable()` that restores state and enters `Status::Recovering` instead of blindly going to `Normal` - the replica must refuse to participate until caught up. i think the right approach is phased PRs: **PR 1: wire Network into Simulator.** per-replica outboxes replacing the shared `MemBus` (rename to `SimOutbox` or similar), phase-separated stepping where each phase borrows exactly one resource (drain outboxes -> `network.submit()` -> `network.step()` -> dispatch to replicas -> `network.tick()`), `step()` becomes `&mut self`. also clean up the dead inbox channel at `replica.rs:88` - the sender is dropped immediately, receiver is never polled. the `Network` wrapper already has a clean API with `submit()`, `step()`, `tick()`, and link filter management - it's ready to be plugged in. this is ~300-400 lines and independently valuable for deterministic fault injection testing. **PR 2: add `crash_replica` only (no restart).** `crash_replica(id)` on Simulator - `BLOCK_ALL` all links, discard outbox, track crashed replicas in a `HashSet<u8>` (not `Option<Replica>` - that pollutes every iterator with `.flatten()`), skip dispatch for crashed replicas in the step loop. ~150-200 lines. after that, the durability prerequisites and recovery stub are internal work we need to do in the consensus layer before `restart_replica` can be implemented. once that lands, a PR 3 for restart becomes straightforward. on the "Paused" state - i'd defer it. it's functionally equivalent to a full network partition (`BLOCK_ALL` on all links) which `PacketSimulator` already supports, and the mid-commit semantics (paused after journal append but before PrepareOk) need careful specification. Up/Down is enough for the first iteration. PR 1 is a great starting point. key files: `core/simulator/src/bus.rs` (refactor into per-replica outbox), `core/simulator/src/lib.rs` (phase-separated step, integrate Network), `core/simulator/src/replica.rs` (remove dead inbox, accept per-replica outbox). GitHub link: https://github.com/apache/iggy/discussions/3017#discussioncomment-16307957 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
