GitHub user krishvishal created a discussion: Simulator "Kill Node" Operation -
Design Discussion
## Problem
The Iggy simulator currently has no mechanism to kill or crash a node. All
replicas are created at startup and live for the entire simulation. To test
consensus correctness under node failures (leader crashes, minority/majority
failures, recovery), we need a `replica_crash` / `replica_restart` operation.
The simulator currently uses a shared MemBus (`Arc<MemBus>`) as a simple FIFO
queue (VecDeque<Envelope>) for all inter-node communication. There is no
concept of a node being "up" or "down"
and `send_to_replica()` enqueues regardless, and `step()` dispatches to any
replica.
To implement kill/restart, we need message filtering for dead nodes.
The PacketSimulator requires &mut self to submit and step packets. But outbound
messages originate deep inside consensus:
```
Simulator::step()
=> replica.on_message(msg)
=> VsrConsensus::on_request(msg)
=> self.message_bus().send_to_replica(next, prepare) // HERE
```
`send_to_replica` is called on `&self` (the `MessageBus` trait is `&self`),
this creates a problem: the `Simulator` owns both the `Network` and the
`Replicas`, and dispatching a message to a replica triggers outbound sends that
need `&mut Network`, while we're already borrowing `&self.replicas[id]`.
## Proposal
MemBus as outbox-only:
MemBus becomes a per-replica outbox. Consensus still calls send_to_replica() /
send_to_client(), but instead of being the delivery queue, the messages are
staged. The Simulator drains each outbox after dispatching to a replica and
feeds the messages into network.submit().
```
Simulator tick loop:
1. packets = network.step() // &mut network
// network borrow ends
2. for packet in packets: // &self.replicas[id]
replica.on_message(packet)
replica.process_loopback()
// consensus calls bus.send_to_replica() → stages in outbox
outbox[id].drain() // collect staged messages
→ network.submit(from, to, msg) // &mut network (replica borrow
ended)
3. network.recycle_buffer(packets) // &mut network
4. network.tick() // &mut network
```
Each phase borrows either replicas or network, never both simultaneously. The
borrow checker is satisfied without `Arc<Mutex<>>`, `RefCell`, or dynamic
dispatch.
## Behavior of `replica_crash`
| Message type | Behavior
| Rationale |
| ------------------------------------- |
-------------------------------------------------- |
------------------------------------------ |
| FROM crashed node, already in network | Delivered to live targets
| already on the wire |
| TO crashed node, in network | Dropped at delivery time (link filter
= BLOCK_ALL) | connection refused |
| FROM crashed node, still in outbox | Discarded (outbox.drain() in
replica_crash) | Never reached the wire |
| TO restarted node, queued pre-restart | Delivered after links re-enabled
| Consensus rejects stale via view/op checks |
Please let me know your thoughts about this proposal.
GitHub link: https://github.com/apache/iggy/discussions/3017
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]