GitHub user krishvishal created a discussion: Simulator "Kill Node" Operation - 
Design Discussion

## Problem

The Iggy simulator currently has no mechanism to kill or crash a node. All 
replicas are created at startup and live for the entire simulation. To test 
consensus correctness under node failures (leader crashes, minority/majority 
failures, recovery), we need a `replica_crash` / `replica_restart` operation.   
The simulator currently uses a shared MemBus (`Arc<MemBus>`) as a simple FIFO 
queue (VecDeque<Envelope>) for all inter-node communication. There is no 
concept of a node being "up" or "down" 
 and `send_to_replica()` enqueues regardless, and `step()` dispatches to any 
replica.                                                                        
                                      

To implement kill/restart, we need message filtering for dead nodes.

The PacketSimulator requires &mut self to submit and step packets. But outbound 
messages originate deep inside consensus:                                       
                            
                  
```
Simulator::step()                                                               
                                                                                
                            
  => replica.on_message(msg)                                                    
                                                                                
                             
    => VsrConsensus::on_request(msg)
      => self.message_bus().send_to_replica(next, prepare)  // HERE             
                                                                                
                             
```

`send_to_replica` is called on `&self` (the `MessageBus` trait is `&self`), 
this creates a problem: the `Simulator` owns both the `Network` and the 
`Replicas`, and dispatching a message to a replica triggers outbound sends that 
need `&mut Network`, while we're already borrowing `&self.replicas[id]`.

## Proposal

MemBus as outbox-only:

MemBus becomes a per-replica outbox. Consensus still calls send_to_replica() / 
send_to_client(), but instead of being the delivery queue, the messages are 
staged. The Simulator drains each outbox after dispatching to a replica and 
feeds the messages into network.submit().

```                                                                             
                                                                                
                                 
Simulator tick loop:

    1. packets = network.step()          // &mut network                        
                                                                                
                              
                                          // network borrow ends
                                                                                
                                                                                
                              
    2. for packet in packets:             // &self.replicas[id]
         replica.on_message(packet)                                             
                                                                                
                              
         replica.process_loopback()
         // consensus calls bus.send_to_replica() → stages in outbox            
                                                                                
                              
                                                                                
                                                                                
                              
         outbox[id].drain()              // collect staged messages             
                                                                                
                              
           → network.submit(from, to, msg)  // &mut network (replica borrow 
ended)                                                                          
                                  
                                                                                
                                                                                
                              
    3. network.recycle_buffer(packets)    // &mut network                       
                                                                                
                              
    4. network.tick()                     // &mut network                       
                                                                                
                              
 ```
                                                                                
                                                                                
                           
Each phase borrows either replicas or network, never both simultaneously. The 
borrow checker is satisfied without `Arc<Mutex<>>`, `RefCell`, or dynamic 
dispatch. 

## Behavior of `replica_crash`

| Message type                          | Behavior                              
             | Rationale                                  |
| ------------------------------------- | 
-------------------------------------------------- | 
------------------------------------------ |
| FROM crashed node, already in network | Delivered to live targets             
             | already on the wire                        |
| TO crashed node, in network           | Dropped at delivery time (link filter 
= BLOCK_ALL) | connection refused                         |
| FROM crashed node, still in outbox    | Discarded (outbox.drain() in 
replica_crash)        | Never reached the wire                     |
| TO restarted node, queued pre-restart | Delivered after links re-enabled      
             | Consensus rejects stale via view/op checks |

Please let me know your thoughts about this proposal.

GitHub link: https://github.com/apache/iggy/discussions/3017

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to